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ISI: Challenges and Research Framework 


The tragic events of September 11, 2001, and the subsequent anthrax 
scare had profound effects on many aspects of society. Terrorism has 
become the most significant threat to domestic security because of its 
potential to bring massive damage to the nation’s infrastructure and 
economy. In response to this challenge, federal authorities are actively 
implementing comprehensive strategies and measures to achieve the 
three objectives identified in the “National Strategy for Homeland 
Security” report (U.S. Office of Homeland Security, 2002): (1) preventing 
future terrorist attacks, (2) reducing the nation’s vulnerability, and (3) 
minimizing the damage and expediting recovery from attacks that occur. 
State and local law enforcement agencies, likewise, have become more 
vigilant about criminal activities that can threaten public safety and 
national security. 

Academics in the natural sciences, computational science, informa- 
tion science, social sciences, engineering, medicine, and many other 
fields have also been called upon to help enhance the government’s capa- 
bilities to fight terrorism and other crime. Science and technology have 
been identified in the “National Strategy for Homeland Security” report 
as the keys to winning the new counter-terrorism war (U. S. Office of 
Homeland Security, 2002). In particular, it is believed that information 
technology and information management will play indispensable roles 
in making the nation safer (Cronin, 2005; Davies, 2002; National 
Research Council, 2002) by supporting intelligence and knowledge dis- 
covery through collecting, processing, analyzing, and utilizing terrorism- 
and crime-related data (Badiru, Karasz, & Holloway, 1988; Chen, 
Miranda, Zeng, Demchak, Schroeder, & Madhusudan, 2003; Chen, 
Moore, Zeng, & Leavitt, 2004). With access to high-quality intelligence, 
federal, state, and local authorities can make timely decisions to select 
effective strategies and tactics and to allocate appropriate resources to 
detect, prevent, and respond to future attacks. 

This chapter addresses issues regarding the development of informa- 
tion technologies in the intelligence and security domain. We propose a 
research framework with a primary focus on knowledge discovery from 
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databases (KDD). After a comprehensive literature review of existing 
technologies used in counter-terrorism and crime-fighting applications, 
we present a set of case studies to demonstrate how KDD and other tech- 
nologies can contribute to the critical objectives of national security. We 
also briefly discuss legal, ethical, and social issues related to the use of 
information technology for national security. 


Information Technology and National Security 


Six critical mission areas have been identified where information 
technology can contribute to accomplishing the three strategic national 
security objectives identified in the “National Strategy for Homeland 
Security” report (U.S. Office of Homeland Security, 2002): 


© Intelligence and warning. Although terrorism depends 
on surprise to bring damage to targets (U.S. Office of 
Homeland Security, 2002), terrorist activities are 
neither random nor impossible to track. Terrorists 
must plan and prepare before the execution of an 
attack by selecting a target, recruiting and training 
operatives, acquiring financial support, and traveling 
to the country where the target is located (Sageman, 
2004). To avoid detection, they may hide their true 
identities and disguise attack-related activities. 
Similarly, criminals may use falsified identities during 
police contacts (Wang, Chen, & Atabakhsh, 2004). 
Although it is difficult, detecting potential terrorist 
attacks or crimes is possible with the help of 
information technology. By analyzing communication 
and activity patterns among terrorists and their 
contacts, detecting fake identities, and employing 
surveillance and monitoring techniques, intelligence 
and warning systems can provide critical alerts and 
timely warnings to prevent attacks or crimes from 
occurring. 


¢ Border and transportation. Terrorists enter a targeted 
country by air, land, or sea. Criminals in narcotics 
rings travel across borders to purchase, transport, 
distribute, and sell drugs. Information such as 
travelers’ identities, images, fingerprints, and vehicles 
used is collected from customs, border, and immigration 
authorities on a daily basis. Such information can 
greatly improve the capabilities of counter-terrorism 
and crime-fighting agencies by creating a “smart 
border,” where information from multiple sources is 
integrated and analyzed to help detect or locate wanted 
terrorists or criminals. Information sharing and 
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integration, collaboration and communication, 
biometrics, and image and speech recognition will ail be 
greatly needed in creating smart borders. 


Domestic counter-terrorism. As terrorists may be 
involved in local crimes, state and local law 
enforcement agencies also contribute by investigating 
and prosecuting crimes. Terrorism is regarded as a 
type of organized crime in which multiple actors coop- 
erate to carry out offenses. Information technologies 
that help unearth cooperative relationships among 
criminals and reveal their patterns of interaction 
would also be helpful for analyzing terrorism. 


Protecting critical infrastructure. Roads, bridges, water 
supply, and many other physical service systems are 
critical infrastructures and key national assets that 
may become the target of terrorist attacks because of 
their vulnerabilities (U.S. Office of Homeland Security, 
2002). Moreover, virtual infrastructures such as the 
Internet are also vulnerable to intrusions and insider 
threats (Lee & Stolfo, 1998). In addition to physical 
devices such as sensors and detectors, advanced 
technologies are needed to model the normal usage 
behaviors of such systems so that abnormalities and 
exceptions can be identified. Preemptive or reactive 
measures can be selected on the basis of the results to 
secure these assets against attacks. 


Defending against catastrophic terrorism. Terrorist 
attacks can cause devastating damage to a society 
through the use of chemical, biological, or radiological 
weapons. Biological attacks, for example, may cause 
contamination, outbreaks of infectious disease, and 
significant loss of life. Information systems that can 
efficiently and effectively collect, access, analyze, and 
report data about potentially catastrophic events can 
help agencies prevent, detect, respond to, and manage 
such attacks (Damianos, Ponte, Wohlever, Reeder, Day, 
Wilson, et al., 2002). 


Emergency preparedness and response. In case of a 
national emergency, prompt and effective responses 
are critical to damage containment and control. In 
addition to the systems that are designed to defend 
against catastrophes, information technologies that 
help formulate, experiment with, and optimize 
response plans (Lu, Huang, & Shekhar, 2003); train 
response professionals; and manage consequences are 
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beneficial in the long run. Moreover, systems that 
provide social and psychological support to the victims 
of terrorist attacks can also help society recover from 
disasters. 


Given the importance of information technology to national security, 
its development for counter-terrorism and crime-fighting applications 
is of the highest priority, despite the many associated problems and 
challenges. 


Problems and Challenges 


Intelligence and security agencies routinely gather large amounts of 
data from various sources. Processing and analyzing such data, however, 
have become increasingly difficult. Treating terrorism as a form of orga- 
nized crime allows us to categorize the challenges into three types: 


e Understanding characteristics of criminals and crimes. 
Some crimes may be geographically diffused and 
temporally dispersed. For instance, transnational 
narcotics trafficking criminals often live in different 
countries, states, and cities. Drug distribution and 
sales occur in different places at different times. This 
is true of other forms of organized crime (e.g., 
terrorism, sex trafficking, labor racketeering). As a 
result, investigations must track and prosecute 
multiple offenders who commit criminal activities in 
different places at different times. Given the limited 
resources at the disposal of intelligence and security 
agencies, this can be difficult. Moreover, as computer 
and Internet technologies advance, criminals are 
committing various types of cybercrime under the 
guise of ordinary online transactions and 
communications. 


e Understanding characteristics of crime and intelligence 
related data. A significant challenge is the information 
stovepipe and overload resulting from diverse data 
sources, multiple data formats, and large data 
volumes. Unlike other professional disciplines such as 
marketing, finance, and medicine, in which data can 
be collected from particular sources (e.g., sales records, 
companies, patient medical histories), the intelligence 
and security domain does not have a well-defined set 
of data sources. Both authoritative information (e.g., 
crime incident reports, telephone records, financial 
statements, immigration and custom records) and open 
source information (e.g., news stories, journal articles, 
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books, Web pages) need to be gathered for 
investigative purposes. Data collected from these 
different sources often exist in different formats, 
ranging from structured database records to 
unstructured text, image, audio, and video files. 
Important information such as evidence of criminal 
associations may be available but buried in 
unstructured texts and difficult to access and retrieve. 
Moreover, as data volumes continue to grow, extracting 
valuable and credible intelligence and knowledge 
becomes more difficult. 


¢ Developing crime and intelligence analysis techniques. 
Current research on the technologies for counter- 
terrorism and crime-fighting applications lacks a 
consistent framework to address the major challenges. 
Some information technologies, including data 
integration, data analysis, text mining, image and 
video processing, and evidence combination, have been 
identified as particularly helpful (National Research 
Council, 2002). However, the question of how to 
employ them in the intelligence and security domain 
remains unanswered. 


We believe that there is a pressing need to develop a science of 
“Intelligence and Security Informatics” (ISI) (Chen, Miranda, et al., 
2003; Chen, Moore, et al., 2004), with its main objective being the “devel- 
opment of advanced information technologies, systems, algorithms, and 
databases for national security related applications, through an inte- 
grated technological, organizational, and policy-based approach” (Chen, 
Miranda, et al., 2003, p. v). 

In comparing ISI with biomedical informatics, a young discipline 
addressing information management issues in biological and medical 
applications, we have found important similarities. In terms of data 
characteristics, they both face the information stovepipe and informa- 
tion overload problems; in terms of technology development, they both 
are at the exploratory stage of searching for new approaches, methods, 
and innovative use of existing techniques; in terms of scientific contri- 
butions, they both may add new insights and knowledge to fields such as 
computer science and decision science. Table 6.1 summarizes the simi- 
larities and differences between ISI and biomedical informatics. Most 
importantly, just as a consistent framework has emerged in biomedical 
informatics (Shortliffe & Blois, 2000), so ISI needs a framework to guide 
its research agenda. We believe that the knowledge discovery from data- 
bases (KDD) methodology, which has achieved significant success in 
other domains, including business, engineering, biology, and medicine, 
could be critical in addressing the challenges and problems facing ISI. 
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Table 6.1 Analogies between ISI and biomedical informatics 


Biomedical Informatics ISI | 
© Complexity and uncertainty ¢ Geographically diffused and 
associated with organisms and temporally dispersed organized 
Domain- diseases crimes 
Specific * Critical decisions regarding patient | * Cybercrimes on the Internet 
well-being and biomedical * Critical decisions related to public 
welll discoveries safety and homeland security 
s Information stovepipe and overload Information stovepipe and overload 
5 ¢ HL7 XML standard ¢ Justice XML standard 
| Data * PHIN MS messaging ¢ Criminal incident records 
5 * Patient records, diseases data, ¢ Multilingual intelligence open 
medical images _|___ sources 
* Ontologies and linguistic parsing ¢ Information integration 
¢ Information integration * Criminal network analysis 
Technology | * Data and text mining ¢ Data, text, and Web mining 
* Medical decision-support systems ¢ Identity management and deception 
and techniques detection 
Methodology | KDD KDD 
4 ¢ Computer and information science, | * Computer and information science, 
z Scientific sociology, policy, legal sociology, policy, legal 
5 * Clinical medicine and biology | * Criminology, terrorism research 
= * Public health © Crime investigation and counter- 
8 Practical * Patient well-being terrorism 
Y L * Biomedical treatment and discovery | * National and homeland security 


The IS! Framework 


To address the data and technical challenges facing ISI, we present a 
research framework with a primary focus on KDD technologies. The 
framework is discussed in the context of types of crime and security 
implications. 

Crime is the commission of an act that is forbidden or the omission of 
a duty that is commanded by a public law, thus making the offender 
liable to punishment under that law. The greater the threat that a par- 
ticular crime poses to public safety, the more likely it is to be viewed as 
a national security concern. Some crimes, such as traffic violations, 
theft, and homicide, lie mainly in the jurisdiction of local law enforce- 
ment agencies. Other crimes need to be dealt with by both local law 
enforcement and national security authorities. Identity theft and fraud, 
for instance, are related to criminal identity management issues at both 
local and national levels. Criminals may escape arrest by using false 
identities; drug smugglers may enter the United States by holding coun- 
terfeited passports or visas. Organized crime, such as terrorism and nar- 
cotics trafficking, often diffuses geographically and temporally, resulting 
in common security concerns across cities and states. Cybercrime can 
pose threats to public safety across multiple jurisdictions because of the 
nature of computer network technology. Table 6.2 summarizes the dif- 
ferent types of crimes, sorted by security level (Chen, Chung, Wu, Chau, 
& Qin, 2004). 
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Table 6.2 Types of crime and security concerns 


Crime Types 
Type Local Law Enforcement Level National Security Level 
Traffic Driving under the influence (DUD, 


Violations fatal/personal injury/property 
damage, traffic accident, road rage 


Sex Crime Sexual offenses, sexual assaults, Transnational child pornography 
child molesting | 
Theft Robbery, burglary, larceny, motor Theft of national secrets or weapon 
vehicle theft, stolen property information 
Fraud Forgery and counterfeiting, fraud, Transnational money laundering, identity 
embezzlement, identity deception fraud, transnational financial fraud 
= J) Property Property crime (e.g., arson) on Intentional destruction of or damage to 
g crime buildings, apartments national infrastructures and assets 
Z| Organized Narcotic drug offenses (sales or Transnational drug trafficking, terrorism 
Crime possession), gang-related offenses, (bioterrorism, bombing, hijacking, etc.), 
organized prostitution 
Violent Criminal homicide, armed robbery, Terrorism 
Crime | aggravated assault, other assaults 
Cybercrime | Internet fraud (e.g., credit card fraud, | Network intrusion/hacking, illegal trading, 
advance fee fraud, fraudulent Web virus spreading, cyberpiracy, 
sites), theft of confidential cyberporography, cyberterrorism, theft of 
information confidential information 


We believe that KDD techniques can play a central role in improving 
the counter-terrorism and crime-fighting capabilities of intelligence and 
security agencies by reducing cognitive and information overload. 
Knowledge discovery refers to nontrivial extraction of implicit, previ- 
ously unknown, and potentially useful knowledge from data. Knowledge 
discovery techniques promise easy, convenient, and practical exploration 
of very large collections of data for organizations and users, and have 
been applied in marketing, finance, manufacturing, biology, and many 
other domains (e.g., predicting consumer behavior, detecting credit card 
fraud, or clustering genes that have similar biological functions). 
Knowledge discovery usually consists of multiple stages, including data 
selection, data preprocessing, data transformation, data mining, and the 
interpretation and evaluation of patterns (Fayyad, Piatetsk-Shapiro, & 
Smyth, 1996). Data mining plays a key role in extracting patterns from 
data. Traditional data mining techniques include association-rule min- 
ing, classification and prediction, cluster analysis, and outlier analysis 
(Han & Kamber, 2001). As natural language processing (NLP) research 
advances, text mining approaches that automatically extract, summa- 
rize, categorize, and translate text documents are also being widely used 
(Trybula, 1999). 

Many of these KDD technologies could be applied in ISI studies (Chen, 
Miranda, et al., 2003; Chen, Moore, et al., 2004). We categorize existing 
ISI technologies into six classes: information sharing and collaboration, 
crime association mining, crime classification and clustering, intelligence 
text mining, spatial and temporal crime pattern mining, and criminal 
network mining. These six classes are grounded in traditional knowledge 
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discovery technologies, but include a few new approaches, such as spa- 
tial and temporal crime pattern mining and criminal network analysis, 
that are more relevant to counter-terrorism and crime investigation. 
Although information sharing and collaboration are not knowledge dis- 
covery per se, they help integrate, warehouse, and prepare data for 
knowledge discovery and thus are included in the framework. 

We present in Figure 6.1 our proposed research framework with the 
horizontal axis representing crime types and vertical axis the six classes 
of techniques (Chen, Chung, et al., 2004). The shaded regions on the 
chart show promising research areas, that is, certain classes of tech- 
niques are relevant to solving certain types of crime. Note that more 
serious crimes may require a more complete set of knowledge discovery 
techniques. For example, the investigation of terrorism may depend on 
criminal network analysis technology, which requires the use of other 
knowledge discovery techniques such as association mining and cluster- 
ing. An important observation about this framework is that the high- 
frequency occurrences and strong association patterns of severe and 
organized crime, such as terrorism and narcotics, present a unique 
opportunity and potentially high rewards for adopting a knowledge dis- 
covery framework. 


Caveats for ISI 

Before we review the technical foundations and approaches, we want 
to discuss briefly the legal and ethical caveats regarding crime and intel- 
ligence research. The potential negative effects of intelligence gathering 
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Figure 6.1 A knowledge discovery research framework for intelligence and 
security informatics. 
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and analysis on the privacy and civil liberties of the public have been 
well publicized (Cook & Cook, 2003). Many laws, regulations, and agree- 
ments governing data collection, confidentiality, and reporting could 
influence directly the development and application of ISI technologies. 
We strongly recommend that intelligence and security agencies and ISI 
researchers be aware of these laws and regulations in their research 
efforts (Strickland, 2005). Moreover, we also suggest that a hypothesis- 
guided, evidence-based approach be used in crime and intelligence 
analysis research. That is, there should be probable and reasonable 
causes and evidence for targeting particular individuals or data sets for 
analysis. Proper investigative and legal procedures need to be strictly 
followed. It is neither ethical nor legal to “fish” for potential criminals 
from diverse and mixed crime-, intelligence-, and citizen-related data 
sources (Strickland, 2005). The well-publicized Defense Advanced 
Research Program Agency (DARPA) Total Information Awareness (TIA) 
program and the Multi-State Anti-Terrorism Information Exchange 
(MATRIX) system, for example, were roundly criticized for their inap- 
propriate use of citizen data and unguided analysis technologies result- 
ing in the potential impairment of Americans’ civil liberties (American 
Civil Liberties Union, 2004). Many new and important privately and 
publicly funded research projects aim to address these privacy and civil 
liberties issues in the context of homeland security research. For exam- 
ple, the Electronic Frontier Foundation monitors limits placed on free- 
dom of expression as indicated by Web sites closed for national security 
reasons by government or Internet service providers (http://www.eff.org/ 
Privacy/Surveillance/Terrorism/antiterrorism_chill.html); the OpenNet 
Initiative is conducting a three-year study of Internet filtering in Saudi 
Arabia (http://www.opennetinitiative.net/studies/saudi). 


ISI: Technical Foundations and Approaches 


In this section, we review the technical foundations of ISI and the six 
classes of technologies and approaches specified in our research frame- 
work. We also summarize relevant past and ongoing research that 
addresses knowledge discovery in public safety and national security. 


Information Sharing and Collaboration 


Information sharing across jurisdictional boundaries of intelligence 
and security agencies has been identified as a key foundation of national 
security (U.S. Office of Homeland Security, 2002). However, sharing and 
integrating information from distributed, heterogeneous, and autono- 
mous data sources is a nontrivial task (Hasselbring, 2000; Rahm & 
Bernstein, 2001). In addition to legal and cultural issues regarding 
information sharing, it is often difficult to integrate and combine data 
that are organized in different schemas and stored in different database 
systems running on different hardware platforms and operating systems 
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(Hasselbring, 2000). Other data integration problems include: (1) name 
differences (same entity with different names), (2) mismatched domains 
(problems with units of measure or reference point), (3) missing data 
(incomplete data sources or different data available from different 
sources), and (4) object identification (no global ID values and no inter- 
database ID tables) (Chen & Rotem, 1998). 

Three approaches to data integration have been proposed: federation, 
warehousing, and mediation (Garcia-Molina, Ullman, & Widom, 2002). 
Database federation maintains data in their original, independent 
sources but provides a uniform data access mechanism (Buccella, 
Cechich, & Brisaboa, 2003; Haas, 2002). Data warehousing is an inte- 
grated system in which copies of data from different data sources are 
migrated and stored to provide uniform access. Data mediation relies on 
“wrappers” to translate and pass queries from multiple data sources. 
The wrappers are “transparent” to an application so that the multiple 
databases appear to be a single database. These techniques are not 
mutually exclusive and many hybrid approaches have been proposed 
(Jhingran, Mattos, & Pirahesh, 2002). 

All these techniques are dependent, to a great extent, on the match- 
ing between different databases. The task of database matching can be 
broadly divided into schema-level and instance-level matching (Lim, 
Srivastava, Prabhakar, & Richardson, 1996; Rahm & Berhstein, 2001). 
Schema-level matching is performed by aligning semantically corre- 
sponding columns between two sources. Various schema elements such 
as attribute name, description, data type, and constraints may be used 
to generate a mapping between the two schemas (Rahm & Bernstein, 
2001). For example, prior studies have used linguistic matchers to find 
similar attribute names based on synonyms, common substrings, pro- 
nunciation, and Soundex codes (Newcombe, Kennedy, Axford, & James, 
1959) to match attributes from different databases (Bell & Sethi, 2001). 
Instance-level or entity-level matching connects records describing a 
particular object in one database to records describing the same object 
in another. Entity-level matching is frequently performed after schema- 
level matching is completed. Existing entity matching approaches 
include (1) key equivalence, (2) user-specified equivalence, (3) proba- 
bilistic key equivalence, (4) probabilistic attribute equivalence, and (5) 
heuristic rules (Lim et al., 1996). 

Some of these information integration approaches have been used in 
law enforcement and intelligence agencies for investigation support. The 
COPLINK Connect system (Chen, Schroeder, Hauck, Ridgeway, 
Atabakhsh, Gupta, et al., 2003) employed the database federation 
approach to achieve schema-level data integration. It provided a com- 
mon COPLINK schema and a one-stop-shop user interface to facilitate 
access to different data sources from multiple police departments. 
Evaluation results showed that COPLINK Connect had out-performed 
the Record Management System (RMS) of police data in system effec- 
tiveness, ease of use, and interface design (Chen, Schroeder, et al., 2003). 


Intelligence and Security Informatics 239 


Similarly, the Phoenix Police Department Reports (PPDR) is a Web- 
based, federated intelligence system in which databases share a com- 
mon schema (Dolotov & Strickler, 2003). The bioterrorism surveillance 
systems developed at the University of South Florida, on the other hand, 
used data warehouses to integrate historical and real-time surveillance 
data and incrementally incorporated data from diverse disease sources 
(Berndt, Bhat, Fisher, Hevner, & Studnicki, 2004; Berndt, Hevner, & 
Studnicki, 2003). A transnational information-sharing system developed 
at the University of Florida employed a data mediation approach (Kasad 
& Su, 2004). The system accessed different databases via a wrapper 
query processor, which tailored a user query into database-specific 
queries. This system was intended to enhance information sharing 
between immigration and border controls in multiple countries. 
Integrating data at the entity level has also been difficult. In addition 
to existing key equivalence matching and heuristic consolidation 
approaches (Goldberg & Senator, 1998), use of the National Incident- 
Based Reporting System (NIBRS) (U.S. Federal Bureau of Investigation, 
1992), a crime incident classification standard, has been proposed to 
enhance data sharing among law enforcement agencies (Faggiani & 
McLaughlin, 1999; Schroeder, Xu, & Chen, 2003). In the Violent Crime 
Linkage Analysis System (ViCLAS) (Collins, Johnson, Choy, Davidson, 
& Mackay, 1998), data collection and encoding standards were used to 
capture more than 100 behavioral characteristics of offenders in serial 
violent crimes in order to address the problem of entity-level matching. 
Information sharing has also been undertaken in intelligence and 
security agencies through cross-jurisdictional collaborative systems. The 
COPLINK Agent ran on top of the COPLINK Connect system (Chen, 
Schroeder, et al., 2003) and linked crime investigators who were work- 
ing on related cases at different units to enhance collaborations (Zeng, 
Qin, Huang, & Chen, 2003). It employed collaborative filtering 
approaches (Goldberg, Nichols, Oki, & Terry, 1992), which have been 
widely studied in commercial recommender systems, to identify law 
enforcement users who had similar search histories. Similar search his- 
tories might indicate that these users had similar information needs and 
thus were working on related crime cases. When one user searched for 
information about a crime or a suspect, the system would alert other 
users who worked on related cases so that these users could collaborate 
and share their information through other communication channels. 
The FALCON system offered similar monitoring and alerting function- 
ality (M. Brown, 1998). Its collaboration capability, however, was rela- 
tively limited. The JNET system (http://www.pajnet.state.pa.us/pajnet/ 
site/default.asp) also provides an alerting capability that immediately 
notifies relevant agencies via pager or e-mail when a wanted person is 
found or arrested by other agencies. Research has also been performed 
to model mathematically collaboration processes across law enforcement 
and intelligence jurisdictions in order to improve work productivity 
(Raghu, Ramesh, & Whinston, 2003; Zhao, Bi, & Chen, 2003). Although 
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information sharing and collaboration are not knowledge discovery per 
se, they prepare data for important subsequent analyses. 


Crime Association Mining 


Finding associations among data items is an important topic in 
knowledge discovery research. One of most widely studied approaches is 
association-rule mining, a process for discovering frequently occurring 
item sets in a database. Association-rule mining is often used in market 
basket analysis where the objective is to find which products are bought 
with which other products (Agrawal, Imielinski, & Swami, 1993; 
Mannila, Toivonen, & Inkeri, 1994; Silverstein, Brin, & Motwani, 1998). 
An association is expressed as a rule X = Y, indicating that item set X 
and item set Y occur together in the same transaction (Agrawal et al., 
1993). Each rule is evaluated using two probability measures, support 
and confidence, where support is defined as prob( M Y) and confidence as 
prob(X M Y)/prob(X). For example, “diaper = milk with 60 percent sup- 
port and 90 percent confidence” means that 60 percent of customers buy 
both diapers and milk in the same transaction and that 90 percent of the 
customers who buy diapers tend to buy milk at the same time. 

In the intelligence and security domain, spatial association-rule min- 
ing (Koperski & Han, 1995) has been proposed to extract cause-effect 
relations among geographically referenced crime data to identify envi- 
ronmental factors that attract crime (Estivill-Castro & Lee, 2001). 
Moreover, the research on association mining is not limited to associa- 
tion-rule mining but covers the extraction of a wide variety of relation- 
ships among crime data items. Crime association mining techniques can 
include incident association mining and entity association mining (Lin & 
Brown, 2003). 

The purpose of incident association mining is to find crimes that 
might have been committed by the same offender; unsolved crimes are 
linked to solved crimes to identify the suspect. This technique is often 
used to solve serial sexual offenses and serial homicides. However, find- 
ing associated crime incidents can be time-consuming if it is performed 
manually. It is estimated that pairwise, manual comparisons on just a 
few hundred crime incidents would take more than one million human 
hours (Brown & Hagen, 2002). When the number of crime incidents is 
large, manual identification of associations between crimes is prohibi- 
tively expensive. Two approaches, similarity-based and outlier-based, 
have been developed for incident association mining. For example, the 
Violent Criminal Apprehension Program (ViCAP) identifies similar fea- 
tures or traits of violent crimes such as homicides to detect serial offend- 
ers (Icove, 1986). 

Similarity-based methods detect associations between crime inci- 
dents by comparing features such as spatial locations of the incidents 
and the offender’s modus operandi (MO), often regarded as a criminal’s 
“behavioral signature” (O'Hara & O’Hara, 1980). Expert systems relying 
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on decision rules acquired from domain experts used to be a common 
approach to associating crime incidents (Badiru et al., 1988; Bowen, 
1994; Brahan, Lam, Chan, & Leung, 1998). However, as the collection of 
human decision rules requires considerable knowledge engineering 
effort and the rules collected are often hard to update, the expert system 
approach has been replaced by more automated approaches. Brown and 
Hagen (2002) developed a total similarity measure between two crime 
records as a weighted sum of similarities of various crime features. For 
features that take on categorical values (such as an offender’s eye color), 
Brown developed a similarity table based on heuristics that specified the 
similarity level for each pair of categorical values. Evaluation showed 
that this approach enhanced both accuracy and efficiency for associating 
crime records. Similarly, Wang, Lin, Shieh, and Deng (2003) proposed 
measuring similarity between a new crime incident and existing crimi- 
nal information stored in police databases by representing the new inci- 
dent as a query and existing criminal information as vector space. The 
vector space model is widely employed in information retrieval applica- 
tions; various similarity measures such as the Jaccard function 
(Rasmussen, 1992) could be used. 

Unlike similarity-based methods, which identify associations based 
on a number of crime features, the outlier-based method focuses only on 
the distinctive features of a crime (Lin & Brown, 2003). Imagine a series 
of robberies in which a Japanese sword was used as the weapon. 
Because a Japanese sword is a very uncommon weapon, unlike, say, a 
shotgun, it is probable that this series of robberies was committed by the 
same offender. Based on this outlier concept, crime investigators need 
first to cluster incidents into cells and then use an outlier score function 
to measure the distinctiveness of the incidents in a specific cell. If the 
outlier score of a cell is larger than a threshold value, the incidents con- 
tained in the cell are assumed to be associated and committed by the 
same offender. Evaluation has shown that the outlier-based method is 
more effective than the similarity-based method proposed in Brown and 
Hagen (2002). 

The task of finding and charting associations between crime entities 
such as persons, weapons, and organizations is often referred to as 
entity association mining (Lin & Brown, 2003) or link analysis (Sparrow, 
1991) in law enforcement. The purpose is to find out whether crime enti- 
ties that appear to be unrelated at the surface are actually linked, and 
if so, how. Law enforcement officers and criminal investigators through- 
out the world have long used link analysis to search for and analyze 
relationships among criminals. For example, the Federal Bureau of 
Investigation (FBI) used link analysis in the investigation of the 
Oklahoma City bombing case and the Unabomber case to look for crim- 
inal associations and investigative leads (Schroeder et al., 2003). 
Although link analysis helps trace criminals through chains of relations, 
manually identifying and detecting criminal relations from large 
amounts of criminal-justice data is very time-consuming. 


242 = Annual Review of Information Science and Technology 


Three types of automated link analysis approaches have been sug- 
gested: heuristic-based, statistically-based, and template-based. 
Heuristic-based approaches rely on decision rules used by domain 
experts to determine whether two entities in question are related. For 
example, Goldberg and Senator (1998) suggested that links or associa- 
tions between individuals in financial transactions be created based on 
a set of heuristics, such as whether the individuals had shared 
addresses, shared bank accounts, or related transactions. This tech- 
nique has been employed by the FinCEN system of the U.S. Department 
of the Treasury to detect money laundering transactions and activities 
(Goldberg & Senator, 1998; Goldberg & Wong, 1998). The COPLINK 
Detect system (Hauck, Atabakhsh, Ongvasith, Gupta, & Chen, 2002) 
employed a statistically based approach called Concept Space (Chen & 
Lynch, 1992). This approach measures the weighted co-occurrence asso- 
ciations between records of entities (persons, organizations, vehicles, 
and locations) stored in crime databases. An association exists between 
a pair of entities if they appear together in the same criminal incident. 
The more frequently they occur together, the stronger the association. 
Zhang, Salerno, and Yu (2003) proposed to use a fuzzy resemblance func- 
tion to calculate the correlation between two individuals’ past financial 
transactions in order to detect associations between the individuals who 
might have been involved in a specific money-laundering crime. If the 
correlation between two individuals is higher than a threshold value, 
these two individuals are regarded as being related. The template-based 
approach has been used primarily to identify associations between enti- 
ties extracted from textual documents, such as police report narratives. 
Lee (1998) developed a template-based technique using relation-specify- 
ing words and phrases. For example, the phrase “member of” indicates 
an entity—entity association between an individual and an organization. 
Coady (1985) proposed to use the PROLOG language to derive rules of 
entity associations automatically from text data and use the rules to 
detect associations in similar documents. Template-based approaches 
rely heavily on a fixed set of predefined patterns and rules, and thus 
may have limited application scope. 


Crime Classification and Clustering 


Classification is the process of mapping data items into one of sev- 
eral predefined categories based on attribute values of the items 
(Hand, 1981; Weiss & Kulikowski, 1991). Examples of classification 
applications include fraud detection (Chan & Stolfo, 1998), computer 
and network intrusion detection (Lee & Stolfo, 1998), bank failure pre- 
diction (Sarkar & Sriram, 2001), and image categorization (Fayyad, 
Djorgovish, & Weir, 1996). Classification is a type of supervised learn- 
ing that consists of a training stage and a testing stage. Accordingly, 
the dataset is divided into a training set and a testing set. The classi- 
fier is designed to “learn” from the training set classification models 
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governing the membership of data items. The accuracy of the classifier 
is assessed using the testing set. 

Discriminant analysis (Eisenbeis & Avery, 1972), Bayesian models 
(Duda & Hart, 1973; Heckerman, 1995), decision trees (Quinlan, 1986, 
1993), artificial neural networks (Rummelhart, Hinton, & Williams, 
1986), and support vector machines (SVM) (Vapnik, 1995) are widely 
used classification techniques. In discriminant analysis the class mem- 
bership of a data item is modeled as a function of the item’s attribute 
values. Through regression analysis a class membership discriminant 
function can be obtained and used to classify new data items. Bayesian 
classifiers assume that all data attributes are conditionally indepen- 
dent, given the class membership outcome. The task is to learn the con- 
ditional probabilities among the attributes, given the class membership 
outcome. The learned model is then used to predict the class member- 
ship of new data items based on their attribute values. Decision tree 
classifiers organize decision rules learned from training data in the form 
of a tree. Algorithms such as ID3 (Quinlan, 1986, 1993) and C4.5 
(Quinlan, 1993) are popular decision tree classifiers. An artificial neural 
network consists of interconnected nodes to imitate the functioning of 
neurons and synapses of human brains. It usually contains an input 
layer with nodes taking on the attribute values of data items and the 
output layer with nodes representing class membership labels. Neural 
networks learn and encode knowledge through connection weights. SVM 
is a novel learning classifier based on the Structural Risk Minimization 
principle from computational learning theory. SVM is capable of han- 
dling millions of inputs and does not require feature selection 
(Cristianini & Shawe-Taylor, 2000). Each of these classification tech- 
niques has its advantages and disadvantages in terms of accuracy, effi- 
ciency, and interpretability. Researchers have also proposed hybrid 
approaches to combine these techniques (Kumar & Olmeda, 1999). 

Several of these techniques have been applied in the intelligence and 
security domain to detect financial fraud and computer network intru- 
sion. For example, in order to identify fraudulent financial transactions, 
Aleskerov, Freisleben, and Rao (1997) employed neural networks to 
detect anomalies in customers’ credit card transactions based on their 
transaction histories. Hassibi (2000) employed a feed-forward back-prop- 
agation neural network to compute the probability that a given transac- 
tion was fraudulent. Two types of intrusion detection, misuse detection 
and anomaly detection, have been studied in computer network security 
applications (Lee & Stolfo, 1998). Misuse detection identifies attacks by 
matching them onto previously known attack patterns or signatures. 
Anomaly detection, on the other hand, identifies abnormal user behav- 
iors based on historical data. Lee and Stolfo (1998) employed decision 
rule induction approaches to classify sendmail system call traces into 
normal or abnormal traces. Ryan, Lin, and Mikkulainen (1998) developed 
a neural network-based intrusion detection system to detect unusual 
user activity based on the patterns of users’ past system command usage. 
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Stolfo, Hershkop, Wang, Nimeskern, and Hu (2003) applied Bayesian 
classifiers to distinguish between normal e-mail and spamming e-mail. 

Unlike classification, clustering is a type of unsupervised learning. It 
groups similar data items into clusters without knowing their class mem- 
bership. The basic principle is to maximize intra-cluster similarity while 
minimizing intercluster similarity (Jain, Murty, & Flynn, 1999). 
Clustering has been used in a variety of applications including image seg- 
mentation (Jain & Flynn, 1996), gene clustering (Eisen, Spellman, Brown, 
& Botstein, 1998), and document categorization (Chen, Houston, Sewell, 
& Schatz, 1998; Chen, Schuffels, & Orwig, 1996). Various clustering meth- 
ods have been developed, including hierarchical approaches, such as com- 
plete-link algorithms (Defays, 1977), partitional approaches, such as 
k-means (Anderberg, 1973; Kohonen, 1995), and Self-Organizing Maps 
(SOM) (Kohonen, 1995). These clustering methods group data items based 
on different criteria and may not generate the same clustering results. 
Hierarchical clustering groups data items into a series of nested clusters 
and generates a tree-like dendrogram in which each node represents a 
merging of clusters. Partitional clustering algorithms generate only one 
partition level rather than nested clusters. Partitional clustering is more 
efficient and scalable for large datasets than hierarchical clustering, but 
has difficulty determining the appropriate number of clusters (Jain et al., 
1999). In contrast to the hierarchical and partitional clustering that relies 
on the similarity or proximity measures between data items, SOM is a 
neural network-based approach that directly projects multivariate data 
items onto two-dimensional maps. SOM can be used for clustering and 
visualizing data items and groups (Chen, Schuffels, et al., 1996). 

The use of clustering methods in the law enforcement and security 
domains can be categorized into two types: crime incident clustering and 
criminal clustering. The purpose of crime incident clustering is to find a 
set of similar crime incidents based on an offender’s behavioral traits or 
a geographical area with a high concentration of certain types of crimes. 
For example, Adderley and Musgrove (2001) employed the SOM 
approach to cluster sexual attack crimes based on a number of offender 
MO attributes (e.g., the precaution methods taken and the verbal 
themes during the crime) in order to identify serial sexual offenders. The 
clusters found were used to form offender profiles containing MO and 
other information such as offender motives and racial preferences when 
choosing victims. Similarly, Kangas, Terrones, Keppel, and La Moria 
(2003) employed the SOM method to group crime incidents in order to 
identify serial murderers and sexual offenders. D. Brown (1998) pro- 
posed &-means and the nearest neighbor approach to clustering spatial 
data of crimes to find “hot spot” areas in a city. Spatial clustering meth- 
ods are often used in “hot spot analysis,” which will be reviewed in detail 
in the section on spatial and temporal mining. 

Criminal clustering is often used to identify groups of criminals who 
are closely related. Instead of using similarity measures, this type of 
clustering relies on relational strength that measures the intensity and 
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frequency of relationships between offenders. Stolfo et al. (2003) pro- 
posed grouping e-mail users who frequently communicated with each 
other into clusters so that unusual e-mail behavior that violated the 
group communication patterns could be detected. Offender clustering is 
more often used in criminal network analysis, which will be reviewed in 
detail in the section with that title. 


Intelligence Text Mining 


A large amount of intelligence- and security-related data is represented 
in text form such as police narrative reports, court transcripts, news sto- 
ries, and Web articles. Valuable information in such texts is often difficult 
to retrieve, access, and use for the purposes of criminal investigation and 
counter-terrorism. It is desirable to mine the text data automatically in 
order to discover valuable knowledge about criminal or terrorism activities. 

Text mining has attracted increasing attention in recent years as nat- 
ural language processing capabilities advance (Chen, 2001). An important 
task of text mining is information extraction, a process of identifying and 
extracting from free text select types of information such as entities, rela- 
tionships, and events (Grishman, 2003). The most widely studied infor- 
mation extraction subfield is named entity extraction. It helps to 
automatically identify from text documents the names of entities of inter- 
est, such as persons (e.g., “John Doe”), locations (e.g., “Washington, DC”), 
and organizations (e.g., “National Science Foundation”). It has also been 
extended to identify other text patterns, such as dates, times, number 
expressions, dollar amounts, e-mail addresses, and Web addresses 
(URLs). The Message Understanding Conference (MUC) series has served 
as the major forum for researchers in this area to compare the perfor- 
mance of their entity extraction approaches (Chinchor, 1998). 

Four major named-entity extraction approaches have been proposed: 
lexical lookup, rule-based, statistical models, and machine learning. 


e Lexical lookup. Most research systems maintain 
hand-crafted lexicons that contain lists of popular 
names for entities of interest, such as all registered 
organizational names in the US. and all personal 
surnames obtained from government census data. 
These systems work by looking up phrases in texts 
that match the items specified in their lexicons (e.g., 
Borthwick, Sterling, Agichtein, & Grishman, 1998). 


e Rule-based. Rule-based systems rely on hand-crafted 
rules to identify named entities. The rules may be 
structural, contextual, or lexical (Krupka & Hausman, 
1998). An example rule would look like the following: 


capitalized last name + , + capitalized first name > 
person name 
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Although such human-created rules are usually of high 
quality, this approach may not be easy to apply to 
other entity types. 


e Statistical models. Such systems often use statistical 
models to identify occurrences of certain cues of 
particular patterns for entities in texts. A training data 
set is needed for a system to acquire the statistics. The 
statistical language model reported in Witten, Bray, 
Mahoui, and Teahan (1999) is an example of such a 
system. It uses the Prediction by Partial Matching 
(PPM) model to extract entities from text based on 
conditional probability distributions of characters. The 
probability of occurrence of later characters in a word 
or phrase depends on the occurrence of preceding 
characters; for example, “12Jan2005” in a newsletter 
can be correctly identified as a time phrase using this 
model. 


¢ Machine learning. This type of system relies on 
machine learning algorithms rather than human- 
created rules to extract knowledge or identify patterns 
from textual data. Examples of machine learning 
algorithms used in entity extraction include neural 
networks, decision trees (Baluja, Mittal, & 
Sukthankar, 1999), Hidden Markov Models (Miller, 
Crystal, Fox, Ramshaw, Schwartz, Stone, et al., 1998), 
and entropy maximization (Borthwick et al., 1998). 


Instead of relying on a single method, most existing information 
extraction systems combine two or more of these approaches. Many sys- 
tems were evaluated at the MUC-7 conference. The best systems were 
able to achieve over 90 percent in both precision and recall rates in 
extracting persons, locations, organizations, dates, times, currencies, 
and percentages from a collection of New York Times news stories. 

Recent years have seen research on named-entity extraction for intel- 
ligence and security applications (Patman & Thompson, 2003; Wang, 
Huang, Teng, & Chien, 2004). For example, Chau, Xu, and Chen (2002) 
developed a neural network-based entity extraction system to identify 
personal names, addresses, narcotic drugs, and personal property names 
from police report narratives. Rather than relying entirely on manual 
rule generation, this system combines lexica] lookup, machine Jearning, 
and some hand-crafted rules. The system achieved over 70-percent pre- 
cision and recall rates for personal names and narcotic drug names. 
However, it was difficult to achieve satisfactory performance for 
addresses and personal property because of their wide variation. Sun, 
Naing, Lim, and Lam (2003) converted the entity extraction problem 
into a classification problem in order to identify relevant entities from 
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the MUC text collection on terrorism. They first identified all noun 
phrases in a document and then used the support vector machine to clas- 
sify those entity candidates on the basis of both content and context fea- 
tures. The results showed that for the specific terrorism text collection, 
the performance of this approach in regards to precision and F measure 
was comparable to AutoSlog (Riloff, 1996), one of the best entity extrac- 
tion systems reported earlier. 

Several news and event extraction systems have been reported 
recently, such as Columbia’s Newsblaster (McKeown, Barzilay, Chen, 
Elson, Evans, Klavans, et al., 2003) and CMU’s (Carnegie Mellon 
University) system (Yang, Carbonell, Brown, Pierce, Archibald, & Liu, 
1999), which automatically extract, categorize, and summarize events 
from international online news sources. Some of these systems can also 
work for multilingual documents and have great potential for automatic 
detection and tracking of terrorism events for intelligence purposes. 


Crime Spatial and Temporal Mining 


Most crimes, including terrorism, have significant spatial and tempo- 
ral characteristics (Brantingham & Brantingham, 1981). Analysis of 
spatial and temporal patterns of crimes continues to be one of the most 
important crime investigation techniques. It aims to gather intelligence 
about environmental factors that prevent or encourage crimes 
(Brantingham & Brantingham, 1981), identify geographic areas of high 
crime concentration (Levine, 2000), and detect criminal trends 
(Schumacher & Leitner, 1999). The discovery of such patterns makes 
possible the use of effective and proactive control strategies, such as allo- 
cating the appropriate amount of policing resources in certain areas at 
certain times, to prevent crimes. 

Spatial pattern analysis and geographical profiling play important 
roles in solving crimes (Rossmo, 1995). Three approaches for crime spa- 
tial pattern mining have been reported: visual approaches, clustering 
approaches, and statistical approaches (Murray, McGuffog, Western, & 
Mullins, 2001). The visual approach is also called crime mapping. It pre- 
sents a city or regional map annotated with various crime-related infor- 
mation. For example, a map can be color-coded to present the densities of 
a specific type of crime in different geographical areas. Such an approach 
can help users visually detect relationships between spatial features and 
the occurrence of crime. The clustering approach has been used in hot 
spot analysis, a process of automatically identifying areas with high 
crime concentration. This type of analysis helps law enforcement effec- 
tively allocate policing resources to reduce crime in hot spot areas. 
Partitional clustering algorithms such as the k-means methods are often 
used for finding hot spots (Murray & Estivill-Castro, 1998). For example, 
Schumacher and Leitner (1999) used the k-means algorithm to identify 
hot spots in the downtown areas of Baltimore. Comparing these for dif- 
ferent years, they found evidence of the displacement of crimes following 
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redevelopment of the downtown area. Corresponding proactive strate- 
gies were then suggested on the basis of the patterns found. Although 
efficient and scalable in comparison to hierarchical clustering algo- 
rithms, partitional clustering algorithms usually require the user to pre- 
define the number of clusters to be found. This, however, is not always 
feasible (Grubesic & Murray, 2001). Accordingly, researchers have tried 
to use statistical approaches to conduct hot spot analysis or to test the 
significance of hot spots (Craglia, Haining, & Wiles, 2000). The test sta- 
tistics G, (Getis & Ord, 1992; Ord & Getis, 1995) and Moran’s J (Moran, 
1950), which are used to test the significance of spatial autocorrelation, 
can be used to detect hot spots. If a variable is correlated with itself 
through space, it is said to be spatially autocorrelated. For example, 
Ratchliffe and McCullagh (1999) employed G, and G,* statistics to iden- 
tify the hot spots of residential burglary and motor vehicle crimes in a 
city. Compared with a domain expert’s perception of the hot spots, this 
approach was shown to be effective (Ratchliffe & McCullagh, 1999). 
Statistical approaches have also been used in crime prediction. Based on 
spatial choice theory (McFadden, 1973), Xue and Brown (2003) modeled 
the probability of a criminal choosing a target location as a function of 
multiple spatial characteristics of the location such as family density per 
unit area and distance to highway. Using regression analysis, they pre- 
dicted the locations of future crimes in a city. Evaluation showed that 
their models significantly outperformed conventional hot spot models. 
Similarly, Brown, Dalton, and Hoyle (2004) built a logistic regression 
model to predict suicide bombing in counter-terrorism applications. 

Commercially available geographical information systems (GIS) and 
crime mapping tools, such as ArcView and MapInfo, have been widely 
used in law enforcement and intelligence agencies for analyzing and 
visualizing spatial patterns of crimes. Geographical coordinate informa- 
tion as well as various spatial features, such as the distance between the 
location of a crime to major roads and police stations, is often used in 
GIS (Harris, 1990; Weisburd & McEwen, 1997). 

Research on temporal patterns of crimes is relatively scarce in com- 
parison to crime mapping. Two major approaches have been reported, 
namely visualization and statistical modeling approaches. Visualization 
approaches present individual or aggregated temporal features of crimes 
using a periodic or timeline view. Common methods of viewing periodic 
data include sequence charts, point charts, bar charts, line charts, and 
spiral graphs displayed in 2-D or 3-D (Tufte, 1983). In a timeline view, a 
sequence of events is presented based on its temporal order. For example, 
LifeLines provides the visualization of a patient’s medical history using a 
timeline view. The Spatial Temporal Visualizer (STV) (Buetow, Chaboya, 
O'Toole, Cushna, Daspit, Peterson, et al., 2003) seamlessly incorporates 
periodic view, timeline view, and GIS view in the system to support crim- 
inal investigations. Visualization approaches rely on human users to 
interpret data presentations and to find temporal patterns of events. 
Statistical approaches, on the other hand, build statistical models from 
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observations to capture the temporal patterns of events. For instance, 
Brown and Oxford (2001) developed several statistical models including 
a log-normal regression model, a Poisson regression model, and cumula- 
tive logistic regression models to predict the number of breaking and 
entering crimes in Richmond, Virginia. The log-normal regression model 
was found to fit the data best. 


Criminal Network Analysis 


Criminals seldom operate in a vacuum but instead interact with one 
another to carry out various illegal activities. Relationships between 
individual offenders form the basis for organized crime and are essential 
for the smooth operation of a criminal enterprise (Cronin, 2005; 
Strickland, 2002a, 2002b, 2002c, 2002d, 2002e). Unlike bureaucratic 
organizations, criminal] enterprises often operate in networks consisting 
of nodes (individual offenders) and links (relationships). In criminal net- 
works, there may exist groups or teams, within which members have 
close relationships. One group may also interact with other groups to 
obtain or transfer illicit goods, services, or information. Moreover, indi- 
viduals play different roles in their groups. For example, some key mem- 
bers may act as leaders to control the activities of a group, while others 
may serve as gatekeepers to ensure the smooth flow of information or 
illicit goods (Strickland, 2002a, 2002b, 2002c, 2002d, 2002e). 

Structural network patterns in terms of subgroups, intergroup inter- 
actions, and individual roles thus are important for understanding the 
organization, structure, and operation of criminal enterprises. Such 
knowledge can help law enforcement and intelligence agencies disrupt 
criminal networks and develop effective control strategies to combat 
organized crime (Cronin, 2005). For example, removal of central mem- 
bers in a network may effectively upset the operational network and put 
a criminal enterprise out of action (Baker & Faulkner, 1993; McAndrew, 
1999; Sparrow, 1991). Subgroups and interaction patterns between 
groups are helpful for finding a network’s overall structure, which often 
reveals points of vulnerability (Evan, 1972; Ronfeldt & Arquilla, 2001). 
For a centralized structure such as a star or a wheel, the point of vul- 
nerability lies in its central members. A decentralized network such asa 
chain or clique, however, does not have a single point of vulnerability 
and thus may be more difficult to disrupt (Strickland, 2002a, 2002b, 
2002c, 2002d, 2002e). 

Social Network Analysis (SNA) provides a set of measures and 
approaches for structural network analysis (Wasserman & Faust, 1994). 
These techniques were originally designed to discover social structures 
in social networks (Wasserman & Faust, 1994) and are especially appro- 
priate for studying criminal networks (McAndrew, 1999; Sparrow, 1991). 
Studies involving evidence mapping in fraud and conspiracy cases have 
employed SNA measures to identify central members in criminal net- 
works (Baker & Faulkner, 1993; Saether & Canter, 2001). In general, 
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SNA is capable of detecting subgroups, identifying central individuals, 
discovering between-group interaction patterns, and uncovering a net- 
work’s overall structure: 


¢ Subgroup detection. With networks represented in a 
matrix format, the matrix permutation approach and 
cluster analysis have been employed to detect 
underlying groups that are not otherwise apparent in 
data (Wasserman & Faust, 1994). Burt (1976) proposed 
to apply hierarchical clustering methods based on a 
structural equivalence measure (Lorrain & White, 
1971) to partition a social network into positions in 
which members have similar structural roles. Xu and 
Chen (2003) employed hierarchical clustering to detect 
criminal groups in a narcotics network based on the 
relational strength between criminals. 


¢ Central member identification. Centrality deals with 
the roles of network members. Several measures, such 
as degree, betweenness, and closeness, are related to 
centrality (Freeman, 1979). The degree of a particular 
node is its number of direct links; its betweenness is 
the number of geodesics (i.e., the shortest paths 
between any two nodes) passing through it; and its 
closeness is the sum of all the geodesics between the 
particular node and every other node in the network. 
Although these three measures are all intended to 
illustrate the importance or centrality of a node, they 
support interpretation of the roles of network members 
differently. An individual having a high degree 
measurement, for instance, may be inferred to have a 
leadership function, whereas an individual with a high 
level of betweenness may be seen as a gatekeeper in 
the network. Baker and Faulkner employed these 
three measures, especially degree, to find the key indi- 
viduals in a price-fixing conspiracy network in the 
electrical equipment industry (Baker & Faulkner, 
1993). Krebs found that, in the network consisting of 
the September 11 hijackers (19 in all), Mohamed Atta 
scored the highest on degree (Krebs, 2001). 


¢ Discovery of patterns of interaction. Patterns of 
interaction between subgroups can be discovered using 
an SNA approach called blockmodel analysis (Arabie, 
Boorman, & Levitt, 1978). Given a partitioned net- 
work, blockmodel analysis determines the presence or 
absence of an association between a pair of subgroups 
by comparing the density of the links between them at 
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a predefined threshold value. In this way, 
blockmodeling introduces summarized individual 
interaction details into interactions between groups so 
that the overall structure of the network becomes more 
apparent. 


SNA also includes visualization methods that present networks 
graphically. The Smallest Space Analysis (SSA) approach (Wasserman & 
Faust, 1994), a branch of Multi-Dimensional Scaling (MDS), is used 
extensively in SNA to produce two-dimensional representations of social 
networks. In a graphical portrayal of a network produced by SSA, the 
stronger the association between two nodes or two groups, the closer 
they appear on the graph; the weaker the association, the farther apart 
(McAndrew, 1999). Several network analysis tools, such as Analyst’s 
Notebook (Klerks, 2001), Netmap (Goldberg & Senator, 1998), and 
Watson (Anderson, Arbetter, Benawides, & Longmore-Etheridge, 1994), 
can automatically draw a graphical representation of a criminal net- 
work. However, these tools do not provide much structural analysis func- 
tionality and rely on investigators’ manual examinations to extract 
structural patterns. 

The six classes of KDD techniques reviewed here constitute the key 
components of our proposed ISI research framework. Our focus on the 
KDD methodology, however, does not exclude other approaches. For 
example, studies using simulation and multi-agent models have shown 
promise in the “what-if” analysis of the robustness of terrorist and crim- 
inal networks (Carley, Dombroski, Tsvetovat, Reminga, & Kamneva, 
2008; Carley, Lee, & Krackhardt, 2002). 

In the next section, we present several case studies showing the value 
and potential of different KDD technologies to accomplish the critical 
objectives of national security. 


\S\ in Giiical Nission Areas: Case Stuthes 


In response to the challenges of national security, the COPLINK 
Center at the University of Arizona has developed several research pro- 
jects to address five of the six critical mission areas identified in the 
National Strategy for Homeland Security report (U.S. Office of 
Homeland Security, 2002): intelligence and warning, border and trans- 
portation security, domestic counter-terrorism, protecting critical infra- 
structure and key assets, and emergency preparedness and response. 
The center’s main goal is to develop information and knowledge man- 
agement technologies appropriate for capturing, accessing, analyzing, 
visualizing, and sharing law enforcement and intelligence-related infor- 
mation (Chen, Zeng, Atabakhsh, Wyzga, & Schroeder, 2003). Through 
the following eight case studies, we demonstrate how critical mission 
issues could be addressed using the knowledge discovery approach. For 
each case study, we discuss its relevance to national security missions, 
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data characteristics, technology used, and selected evaluation results. 
Quantitative studies focused primarily on the performance of the tech- 
niques in terms of effectiveness, accuracy, efficiency, usefulness, and so 
forth. In qualitative studies where quantitative results are not yet avail- 
able, we summarize and report comments and feedback from our domain 
experts. 


Intelligence and Warning 


Although terrorism depends on surprise (U.S. Office of Homeland 
Security, 2002), terrorist attacks are not random but require careful 
planning, preparation, and cooperation before execution. To avoid being 
preempted by authorities, terrorists may disguise their true identities or 
hide their illegal] objectives and intents behind legal activities. Similarly, 
criminals may try to minimize the possibility of being identified and cap- 
tured by using falsified identities. To detect hidden intent and potential 
for future attacks or offenses is the main goal of intelligence and warn- 
ing systems. In this section, we present two case studies addressing 
intelligence and warning needs. The first helped to detect deceptive 
identity records in police data (Wang, Chen, et al., 2004), while in the 
second, we present our design for an intelligence Web portal to help 
trace and monitor the Web sites of terrorist organizations (Chen, Qin, 
Reid, Chung, Zhou, Xi, et al., in press; Reid, Qin, Chung, Xu, Zhou, 
Schumaker, 2004). 


Case Study 1: Detecting Deceptive Criminal Identities 


It is common practice for criminals to lie about the particulars of their 
identities, such as name, date of birth, address, and social security num- 
ber, in order to deceive police investigators. Inability to validate identity 
can be used as a warning mechanism because the deception signals an 
intent to commit future offenses. In this case study, we focus on uncov- 
ering patterns of criminal identity deception based on actual criminal 
records and suggest an algorithmic approach to revealing false identities 
(Wang, Chen, et al., 2004). 

Data used in this study were authoritative criminal identity records 
obtained from the Tucson Police Department (TPD). These records were 
structured database entries containing criminal identity information, 
such as name, date of birth (DOB), address, identification number (e.g., 
social security number), race, weight, and height. The total number of 
criminal identity records stored in the TPD databases was over 1.5 mil- 
lion. In order to study the patterns of criminal identity deception, we 
selected from the TPD database 372 records involving 24 criminals, each 
having one real identity record and several deceptive records. These sets 
of deceptive records were not randomly sampled from the database, but 
were manually extracted by a police detective expert who has served in 
law enforcement for 30 years. The expert used convenience sampling, in 
which he reviewed the list of all identity records and chose the deceptive 
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identity records that he encountered. Because deceptive identities are 
sparsely distributed in the criminal database, convenience sampling is 
more effective than random sampling for experimental purposes. As a 
result, the conclusions may not be statistically valid. 

We carefully examined these 372 records and found that deception 
occurred most often in specific attributes: name, address, birth date, and 
Social Security Number (SSN). The identity deception patterns in this 
dataset are shown in Figure 6.2. Name deception, occurring in most 
cases, includes giving a false first name and a true last name or vice 
versa, changing the middle initial, and giving a name pronounced simi- 
larly but spelled differently. Deception on DOB can consist of, for exam- 
ple, switching places between the month of birth and the day of birth. 
Similarly, ID deception is often made by changing a few digits of an SSN 
or by switching their places. In residency deception, criminals usually 
change only one portion of the address. For example, we found that, in 
about 87 percent of cases, criminals provided a false street number along 
with the true street direction, street name, or street type. 

To detect deceptive identity records automatically, we employed a 
similarity-based association mining method to extract associated (simi- 
lar) record pairs. Based on the deception patterns found, we selected 
four attributes (name, DOB, SSN, and address) for our analysis. We 
compared and calculated the similarity between the values of corre- 
sponding attributes of each pair of records. If two records were signifi- 
cantly similar, we assumed that at least one of them was deceptive. 
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Figure 6.2 Identity deception patterns (each percentage number represents the 
proportion of records that contain the particular type of deception in 
the selected dataset). 
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Because the four selected attributes primarily have string values, we 
compared two attribute values based on their edit distance 
(Levenshtein, 1966) and Soundex code (Newcombe et al., 1959). The edit 
distance between two strings is the minimum number of single charac- 
ter insertions, deletions, and substitutions required to transform one 
string into the other. Soundex code represents the phonetic pattern of a 
string. For example, “PEARSE” and “PIERCE” are both coded as “P620.” 
To detect both spelling and phonetic variations between two name 
strings, edit distance similarity and Soundex similarity were computed 
separately. In order to capture name exchange deception, similarities 
were also computed based on different sequences of first name and last 
name. We took the similarity value from the sequence that had the max- 
imal value between two names. We used only edit distance to compare 
non-phonetic attributes of DOB, SSN, and address. Each similarity 
value was normalized between 0 and 1. The similarity value over all four 
attributes was calculated by means of a normalized Euclidean distance 
function. 

In order to test the performance of our approach, we used convenience 
sampling again to select another set of 120 records. However in this 
case, we chose only records with complete information in the name, 
address, DOB, and SSN fields. The 120 records involved 44 criminals, 
each of whom had an average of three records in the sample set. Some 
data were used to train and test our algorithm so that records pointing 
to the same suspect could be associated with each other. Training and 
testing were validated by a standard hold-out sampling method. Of the 
120 records in the testbed, 80 (66.7 percent) were used for training the 
algorithm, and the remaining 40 were used for testing. 

A similarity matrix was built for all training records. Similarity val- 
ues in the matrix were used to establish the threshold values appropri- 
ate to distinguish between similar and dissimilar pairs. Accuracy rates 
for correctly recognized similar pairs of records using different threshold 
values are shown in Table 6.3. When the threshold similarity value was 
set to 0.52, our algorithm achieved its highest accuracy of 97.4 percent, 


Table 6.3 Accuracy comparison based on different threshold values 


Threshold Accuracy | False Negative’ False Positive” 
0.6 76.60% 23.40% 0.00% 
0.55 92.20% 7.80% 0.00% 
0.54 93.50%) 6.50% 2.60% 
0.53 96.10% 3.90%, 2.60% 
0.52 2.60% 2.60% 
0.51 97.40% 2.60% 6.50% 
0.5 97.40%) 2.60% 11.70% 


*False negative: consider disimilar records as similar ones 
**False positive: consider similar records as disimilar ones 
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with relatively small false negative and false positive rates; both were 
2.6 percent. 

A similarity matrix was also built for the 40 test records. By applica- 
tion of the optimal threshold value to the testing similarity matrix, 
records having a similarity value of more than 0.52 were considered to 
be pointing to the same offender. The accuracy of association in the test- 
ing data set is shown in Table 6.4. The result shows that the algorithm 
is effective (with an accuracy level of 94 percent) in linking deceptive 
records pointing to the same offender. 

Although the case study produced promising results, much more 
research is needed for deception detection, which we believe is a unique 
and critical problem for ISI. 


Table 6.4 The accuracy of association in the testing data set 


Threshold False Negative False Positive | 


6.0% 


Accuracy 


0.52 94.0% 0.0% | 


Case Study 2: The “Dark Web” Portal 


Because the Internet has become a global platform for information 
dissemination and communication, terrorists also take advantage of the 
freedom of cyberspace and construct their own Web sites to propagate 
terrorist ideology, share information, and recruit new members. Web 
sites of terrorist organizations may also connect to one another through 
hyperlinks, forming a “dark Web.” We are building an intelligent Web 
portal, called the Dark Web Portal, to help terrorism researchers collect, 
access, analyze, and understand terrorist groups (Chen, Qin, et al., in 
press; Reid et al., 2004). This project consists of three major components: 
Dark Web testbed building, Dark Web link analysis, and Dark Web 
Portal building. 


e Dark Web Testbed Building. Drawing on reliable 
governmental sources such as the Anti-Defamation 
League (ADL), FBI, and United States Committee for 
a Free Lebanon (USCFL), we identified 224 U.S. 
domestic terrorist groups and 440 international 
terrorist groups. For U.S. domestic groups, group- 
generated URLs can be found in FBI reports and the 
Google Directory. For international groups, we used 
the group names as queries to search major search 
engines such as Google and manually identified the 
group-created URLs from the result lists. To ensure 
that our testbed covered all the major regions in the 
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world, we sought the assistance of language experts in 
English, Arabic, Spanish, and Japanese to help us col- 
lect URLs in different regions. All URLs collected were 
manually checked by experts to make sure that they 
were created by terrorist groups. Once a group’s URL 
was identified, we used the SpidersRUs toolkit, a mul- 
tilingual Digital Library building tool developed by our 
own group, to collect all the Web pages under that 
URL and store them in our testbed. We have collected 
500,000 Web pages created by U.S. domestic groups, 
400,000 Web pages created by Arabic-speaking groups, 
100,000 Web pages created by Spanish-speaking 
groups, and 2,200 Web pages created by Japanese- 
speaking groups. This testbed is updated bimonthly. 


Dark Web Link Analysis and Visualization. Terrorist 
groups are not atomized individuals but actors linked 
to each other through complex networks of direct or 
mediated exchanges. Identifying how relationships 
between groups are formed and dissolved in the 
terrorist group network would enable us to reveal the 
social milieux and communication channels among 
terrorist groups across different jurisdictions. Previous 
studies have shown that the link structure of the Web 
represents a considerable amount of latent human 
annotation (Gibson, Kleinberg, & Raghavan, 1998). 
Thus, by analyzing and visualizing hyperlink 
structures between terrorist-generated Web sites and 
their content, we could discover the structure and 
organization of terrorist group networks, capture net- 
work dynamics, and understand their emerging 
activities (e.g., exploiting formal or informal banking 
systems, changing identities to take on characteristics 
more identifiable with Western societies, or creating 
their own online communities). To test our ideas, we 
conducted an experiment in which we analyzed and 
visualized the hyperlink structure between 
approximately 100,000 Web pages from 46 Web sites in 
our current testbed. These 46 Web sites were created 
by four major Arabic-speaking terrorist groups, namely 
Al-Gama’s al-Islamiyya (Islamic Group, IG), Hizbalia 
(Party of God), Al-Jihad (Egyptian Islamic Jihad), and 
Palestinian Islamic Jihad (PIJ) and their supporters. 
Hyperlinks between each pair of the 46 Web sites were 
extracted from the Web pages and a closeness value 
was calculated for each pair of the 46 Web sites as 
shown in Figure 6.3. Each node represents a Web site 
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created by one of the 46 groups. A link existing 
between two nodes means there are hyperlinks 
between the Web pages of the two sites. We presented 
this network to several domain experts and confirmed 
that the structure of the diagram matched the experts’ 
knowledge of how the groups related to each other in 
the real world. The four clusters represent a logical 
mapping of the existing relations among the 46 groups. 
For instance, the Palestinian terrorist group’s cluster 
includes many of these groups’ Web sites, as well as 
their leaders’ sites. Examples include the Al-Aqsa 
Martyrs’ Brigade (http://www.katae.baqsa.org), 
HAMAS (http://www.ezzedeen.net), and PIJ 
(http://www.abrarway.com). 


¢ Dark Web Portal Building. Using the Dark Web Portal, 
experts are able to locate specific dark Web 
information in the testbed quickly through keyword 
search. To address the information overload problem, 
the Dark Web Portal is designed with post-retrieval 
components. A modified version of a text summarizer 
called TXTRACTOR, which uses sentence-selection 
heuristics to rank and select important text segments 
(McDonald & Chen, 2002), has been incorporated into 
the Dark Web Portal. The summarizer can flexibly 
summarize Web pages so that experts can quickly get 
the main idea of a page without having to read 
through it. A categorizer organizes the search results 
into various folders labeled with the key phrases 
extracted by the Arizona Noun Phraser (AZNP) (Tolle 
& Chen, 2000) from the page summaries or titles, 
thereby facilitating the understanding of different 
groups of Web pages. A visualizer clusters Web pages 
into colored regions using the SOM algorithm 
(Kohonen, 1995), thus reducing information overload 
when a large number of search results is obtained. 
Post-retrieval analysis could further reduce 
information overload, but researchers are limited to 
data in their native languages and cannot fully utilize 
the multilingual information in the testbed. To address 
this problem, we have added a cross-lingual 
information retrieval (CLIR) component into the 
portal. On the basis of our previous research, we have 
developed a dictionary-based CLIR system for use in 
the Dark Web Portal. It currently accepts English 
queries and retrieves documents in English, Spanish, 
Chinese, Japanese, and Arabic. A machine translation 
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(MT) component will be added to the Dark Web Portal 
to translate the multilingual information retrieved by 
the CLIR component back into the experts’ native 
languages. 


Because terrorist groups continue to use the Internet as a communi- 
cation, recruiting, and propaganda tool, a systematic and system-aided 
approach to studying their presence on the Web is critically needed. 


Border and Transportation Security 


Terrorists enter a targeted country by air, land, or sea. The govern- 
ment can improve its counter-terrorism and crime-fighting capabilities 
by creating a “smart border,” where information from borders, customs, 
transportation, and local law enforcement agencies is integrated and 
analyzed to help locate wanted terrorists or criminals. Our “BorderSafe” 
project for cross-jurisdictional information integration and sharing 
(Marshall, Kaza, Xu, Atabakhsh, Petersen, Violette, et al., in press) 
illustrates how a smart and safe border might be created. 
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Figure 6.3 Web site structural relationships between 46 terrorist organizations or 
affiliated groups. 
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Case Study 3: Enhancing BorderSafe 


The BorderSafe project is a collaborative research effort involving the 
University of Arizona’s Artificial Intelligence Lab; several law enforce- 
ment agencies including the Tucson Police Department (TPD), Phoenix 
Police Department (PPD), Pima County Sheriff’s Office (PCSO), and 
Tucson Customs and Border Protection (CBP); the San Diego ARJIS 
(Automated Regional Justice Systems, a regional consortium of more 
than 50 public safety agencies); the San Diego Supercomputer Center 
(SDSC); and the Corporation for National Research Initiatives (CNRD. 

In this study our objective was to integrate structured, authoritative 
data from TPD, PCSO, and a limited dataset from CBP containing 
license plate data of border crossing vehicles. Tables 6.5 and 6.6 present 
the statistics from the three datasets. TPD’s and PCSO’s jurisdictions 
represent a shared community of citizens in Tucson and southern 
Arizona. They also share intertwined communities of criminals. We 
found a substantial amount of data overlap among these datasets. 
Around seven percent of vehicles involved in gang-related, violent, and 
narcotics crimes were registered outside of Arizona. More than 483,000 
people appeared in both the TPD and PCSO datasets, representing 36 
percent of the TPD records and 37 percent of the PCSO records. These 
statistics strongly suggest that sharing information across jurisdictions 
could help catch criminals. 

The federation approach to data integration was employed. We 
adopted the COPLINK schema as the global schema and developed a 
transformation mechanism to reconcile the database structure and 
semantics from a particular database into the global schema. Data were 
then mapped or transformed to allow shared query processing. In our 


Table 6.5 Statistics regarding the TPD and PCSO datasets 


TPD PCSO 
Number of recorded incidents 2.84 million 2.18 million 
Number of persons 1.35 million 1.31 million 
Number of vehicles 62,656 520,539 
Table 6.6 CBP border crossing dataset 

Number of records 1,125,155 

Number of distinct vehicles 226,207 

Number of plates issued in AZ 130,195 

Number of plates issued in CA 5,546 


Number of plates issued in Mexico | 90,466 
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datasets, establishing automated transformation procedures for legacy 
PCSO and TPD records into COPLINK format resolved most of the 
structural and semantic difference issues. 

At the instance level, each dataset had a unique key assigned to each 
person or vehicle, but these unique keys did not match across datasets. 
To address this problem, vehicles were matched between datasets on the 
basis of their license plate numbers. We based people matching on input 
from domain experts and assumed that all records with the same first 
name, last name, and DOB represented the same person. These heuris- 
tics were not perfect; a few incorrect matches resulted and certainly 
many correct matches may have been missed. We plan to employ our 
new identity deception detection approach (Wang, Chen, et al., 2004) in 
the future to improve instance-level matching. 

We generated and visualized several criminal networks based on inte- 
grated data. We extracted associations between a set of criminals and 
vehicles from crime incident records. A link was created when two or 
more criminals or vehicles were listed in the same incident record. In 
network visualization we differentiated entity types by shape, key 
attributes by node color, level of activity (measured by number of crimes 
committed) by node size, data source by link color, and some details in 
link text or roll-over tool tips. Figure 6.4 shows a network connecting a 
known narcotics dealer to a border crossing plate. 

A qualitative field study provided positive feedback regarding the 
potential of our data integration approach. Currently, the crime analysts 
from both TPD and PCSO are using the triangulated, integrated crimi- 
nal networks generated by our system to monitor vehicles and criminals 
crossing the border. 
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Figure 6.4 A sample criminal network based on integrated data from multiple 
sources. Nodes and links are color coded in the actual system. 
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Domestic Counter-Terrorism 


As terrorists may be involved in local crimes, state and local law 
enforcement agencies contribute to national security by investigating 
and prosecuting these crimes. Terrorism, like gangs and narcotics traf- 
ficking, is treated as a type of organized crime in which multiple offend- 
ers cooperate to carry out criminal activities. Information technologies 
that aid in the discovery of cooperative relationships among criminals 
and reveal the patterns of their interaction would also be helpful in ana- 
lyzing terrorism. Through three case studies in this section, we show 
how criminal association information can be extracted from large vol- 
umes of data (Hauck et al., 2002) and how structural patterns in crimi- 
nal or terrorist organizations can be discovered (Xu & Chen, 2003, in 
press). 


Case Study 4: COPLINK Detect 


Crime analysts and detectives search for criminal associations to 
develop investigative leads. However, because association information is 
not directly available in most existing law enforcement and because 
intelligence databases and manual searching are extremely time con- 
suming, automatic identification of relationships among criminal enti- 
ties may significantly speed up investigations. COPLINK Detect is a 
link analysis system that automatically extracts relationship informa- 
tion from large volumes of crime incident data (Hauck et al., 2002). 

Our data were structured crime incident records stored in TPD data- 
bases. The TPD’s current record management system (RMS) consists of 
more than 1.5 million crime incident records that contain details of crim- 
inal events spanning from 1986 to 2004. Although investigators can 
access the RMS to tie information together, they must manually search 
the RMS for connections or existing relationships. 

We used the concept space approach (Chen & Lynch, 1992) to identify 
relationships between entities of interest. Concept space analysis is a 
type of co-occurrence analysis used in information retrieval. The result- 
ing network-like concept space holds all possible associations between 
terms—that is, the system retains and ranks every existing link 
between every pair of concepts. In COPLINK Detect, detailed incident 
records serve as the underlying space, and concepts are derived from the 
meaningful terms that occur in each incident. Concept space analysis 
easily identifies relevant terms and their degree of relationship to the 
search term. The system output includes relevant terms ranked in the 
order of their degree of association, thereby distinguishing the most rel- 
evant terms from inconsequential ones. From a crime investigation 
standpoint, concept space analysis can help investigators link known 
entities to other related entities that might contain useful information 
for further investigation, such as people and vehicles related to a given 
suspect. It is considered an example of entity association mining (Lin & 
Brown, 2003). 
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Information related to a suspect can move an investigation in the 
right direction, but revealing relationships among data in one particular 
incident might fail to capture other relationships from the entire data- 
base. In effect, investigators need to review all incident reports related 
to a suspect and this can be tedious work. The COPLINK Detect system 
introduces concept space as an alternative method that captures the 
relationships between four types of entities (person, organization, loca- 
tion, and vehicle) across the entire database. COPLINK Detect also 
offers an easy-to-use interface and allows searching for relationships 
among the four types of entities. Figure 6.5 presents the COPLINK 
Detect interface, showing sample search results for vehicles, relations, 
and crime case details (Hauck et al., 2002). 

We conducted user studies to evaluate the performance and useful- 
ness of COPLINK Detect. Eleven crime analysts and one homicide detec- 
tive from TPD participated in the longitudinal field study over a 
four-week period. Crime analysts were experienced in investigating 
high-profile cases as well as creating statistical reports on criminal 
activities. They were accustomed to link analysis and are the target user 
group of COPLINK Detect. Although detectives were not specialized in 
crime analysis in general, the participating homicide detective was expe- 
rienced in searching for criminal associations using record management 
systems. In this study, three major areas were identified where 
COPLINK Detect provided improved support for crime investigation: 
link analysis, interface design, and operating efficiency. 
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Figure 6.5 COPLINK Detect interface showing sample research results. 
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Case Study 5: Criminal Network Mining 


Because organized crime is carried out by networked offenders, inves- 
tigation naturally depends on network analysis approaches. Grounded 
in social network analysis methodology, our criminal network-structure 
mining research aims at helping intelligence and security agencies 
extract valuable knowledge regarding criminal or terrorist organiza- 
tions by identifying the central members, subgroups, and overall net- 
work structure (Xu & Chen, 2003, in press). 

Two datasets from TPD were used in the study. (1) A gang network: 
The list of gang members consisted of 16 offenders who had been under 
investigation during the first quarter of 2002. These gang members had 
been involved in 72 crime incidents of various types (e.g., theft, burglary, 
aggravated assault, drug offenses) since 1985. We used the concept 
space approach and generated links between criminals who had com- 
mitted crimes together, ending with a network of 164 members. (2) A 
narcotics network: The list for the narcotics network consisted of 71 
criminal names. A sergeant from the Gang Unit had been studying the 
activities of these criminals since 1995. Because most of them had com- 
mitted crimes related to methamphetamines, the sergeant called this 
network the “Meth World.” These offenders had been involved in 1,206 
incidents since 1983. A network of 744 members was generated. 

We made use of SNA approaches to extract structural patterns in the 
criminal networks: 


¢ Network partition. We employed hierarchical 
clustering, namely the complete-link algorithm, to 
partition a network into subgroups based on relational 
strength. Clusters obtained represent subgroups. To 
employ the algorithm, we first transformed 
co-occurrence weights generated in the previous 
phrase into distances/dissimilarities. The distance 
between two clusters was defined as the distance 
between the pair of nodes drawn from each cluster 
that was farthest apart. The algorithm worked by 
merging the two nearest clusters into one cluster at 
each step and eventually formed a cluster hierarchy. 
The resulting cluster hierarchy specified groupings of 
network members at different granularity levels. At 
lower levels of the hierarchy, clusters (subgroups) 
tended to be smaller and group members were more 
closely related. At higher levels of the hierarchy, sub- 
groups are large and group members may be loosely 
related. 


Centrality measures. We used all three centrality 
measures to identify central members in a given sub- 
group. The degree of a node could be obtained by 
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counting the total number of links it had to all the 
other group members. A node’s score of betweenness 
and closeness required the computation of shortest 
paths (geodesics) using Dijkstra’s (1959) algorithm. 


e Blockmodeling. At a given level of a cluster hierarchy, 
we compared intergroup link densities with the 
network’s overall link density to determine the 
presence or absence of intergroup relationships. 


e Visualization. To map a criminal network onto a 
two-dimensional display, we employed multi- 
dimensional scaling (MDS) to generate x-y coordinates 
for each member in a network. We chose Torgerson’s 
(1952) classical metric MDS algorithm because 


(a) Left A 57-member cnminal 
network Each node is labeled using 
the name of the criminal it represents 
Lines represent the relationships 
between cnminals 


(b) Right The reduced structure of the 
network. Each circle represents one 
subgroup labeled by its leader's name. 
The size of the circle 1s proportional to 
the number of cnminals in the group. A 
line represents a relationship between 
two groups. The thickness represents 
the strength of the relah onship 


(c) Right: The inner structure of the biggest group (the 
relationships between group members). Centrality 
rankings of members in this group are listed in a table 
at the nght-hand side 


Figure 6.6 An SNA-based system for criminal network analysis and visualization. 
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distances transformed from co-occurrence weights 
were quantitative data. 


A graphical user interface was provided to visualize criminal net- 
works. Figure 6.6 shows the screenshot of our prototype system. In this 
example, each node was labeled with the name of the criminal it repre- 
sented. Criminal names were scrubbed for data confidentiality. A 
straight line connecting two nodes indicated that two corresponding 
criminals committed crimes together and thus were related. To find sub- 
groups and interaction patterns between groups, a user could adjust the 
“level of abstraction” slider at the bottom of the panel. A high level of 
abstraction corresponded with a high distance level in the cluster hier- 
archy. Group members’ rankings in centrality are listed in a table. 

A qualitative study was conducted to evaluate the prototype system. 
We presented the two testing networks to domain experts at TPD and 
received encouraging feedback (Xu & Chen, 2003): 


¢ Subgroups detected were mostly correct. The domain 
experts checked and validated the members in each 
group. These groups had different characteristics with 
different specialties or crime preferences. We also 
found that although relationships in our network were 
extracted based on crime incidents, they reflected 
relationships between criminals based on friendship, 
kinship, and even conflicts. 


® Centrality measures provided ways of identifying key 
members in a network. According to our domain 
experts, betweenness was a reliable measure to iden- 
tify gatekeepers between subgroups. However, degree 
sometimes misidentified leaders because the 
criminals with the most connections to others may not 
always be the leaders. Leaders may be smart enough 
to hide behind other criminals to avoid police contact. 


e Interaction patterns identified could help reveal 
relationships that previously had been overlooked. Our 
system could generate the “big picture” for a complex 
network. As a result some relationships between 
criminal groups that had been overlooked before the 
system were made easier to identify. 


e Saving investigation time. Our domain experts had 
obtained knowledge about the gang and narcotics 
organizations based on several years of work. Using 
information gathered from a large number of arrests 
and interviews, they had built the networks 
incrementally by linking new criminals to known 
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gangs in the network and then studying the 
organization of these networks. Because there was no 
structural analysis tool available, they did all of this 
by hand. With the help of our system, they expected 
that substantial time would be saved in network 
creation and structural analysis. 


e Saving training time for new investigators. New 
investigators who did not have sufficient knowledge of 
criminal organizations and individuals could use the 
system to grasp the essence of the network and 
related crime history quickly. They would not have to 
spend as much time studying hundreds of incident 
reports. 


e Helping prove guilt of criminals in court. The 
relationships discovered between individual criminals 
and criminal groups would be helpful for proving guilt 
when presented at court for prosecution. 


Case Study 6: Analyzing Terrorist Networks 


As part of the worldwide Islamic Jihadist movement, a number of ter- 
rorist organizations have targeted the West. Terrorism and terrorist 
attacks pose severe threats and have caused significant damage world- 
wide. Only with an in-depth understanding of terrorism and terrorist 
organizations can societies defend themselves against the threats. 
Because terrorist organizations often operate in networks through which 
individual terrorists collaborate to carry out attacks (Klerks, 2001; 
Krebs, 2001), network analysis can help uncover valuable information by 
studying the networks’ structural properties (Xu & Chen, in press). We 
have employed techniques and methods from SNA and Web mining to 
address the problem of structural analysis of terrorist networks. 

The objective of this case study was to examine the potential of net- 
work analysis tools for terrorist analysis. By comparing our findings 
with experts’ input we sought to ascertain whether automatic analysis 
of structural properties of a terrorist network would generate informa- 
tion consistent with expert knowledge. 

In this study, we focused on the structural properties of a set of Islamic 
terrorist networks, including Osama bin Laden’s Al Qaeda. In a recently 
published book, Sageman (2004) documented the history and evolution of 
these terrorist organizations, which he terms Global Salafi Jihad (GSJ). 
Sageman is a social psychologist and formerly served as a foreign service 
officer. During the Afghan-Soviet war, from 1986 to 1989, he dealt with 
Islamic fundamentalists on a daily basis and developed substantial exper- 
tise in terrorism and terrorist organizations. Drawing upon various open 
sources, such as news articles and court transcripts, he collected data on 
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364 terrorists in the GSJ network regarding their background, religious 
beliefs, social relations, and the terrorist attacks in which they partici- 
pated. There are three types of social relations among these terrorists: 
personal links (e.g., acquaintance, friendship, and kinship), operational 
links (e.g., collaborators in the same attack), and relations formed after 
attacks (Sageman, 2004). Sageman identified four major terrorist groups 
on the basis of their geographical locations: Central Staff, Core Arab, 
Maghreb Arab, and Southeast Asian. Each group has its own leaders. For 
example, Osama bin Laden is the leader of the Central Staff group, which 
connects to the other three groups through several lieutenants. 

We analyzed the GSJ network based on the social relation data con- 
tained in a spreadsheet provided by Sageman. Using the SNA visualiza- 
tion approach, we depicted the GSJ network graphically as shown in 
Figure 6.7. 


e Centrality analysis. Considering all three types of 
social relations, we found that the four group leaders 
were among the 11 most popular members, where 
popularity was represented by degree measure. For 


(a) Left: The GSJ network with all types of relations. Each 
node represents a terrorist. A link represents a social relation 
‘The four terrorist groups are color-coded in the actual 
system: Central Staff—pink, Core Arab—yellow, Maghreb 
Arab—blue, and Southeast Asian—green. Leaders are 
labeled in red and lieutenants are labeled in black. 


(b) Left: The GSJ network 
with personal links. The 
blue path indicates the 
hypothesis regarding the 
connection between bin 
Laden and the 9/11 attacks. 


(c) Right. The GSJ network with operational links. A link 
between two terrorists indicates that they were involved in 
the same attack. Circles of nodes represent specific 
attacks. The circles can also be called cliques where group 
members are densely connected with other group 
members. 


Figure 6.7 The Global Salafi Jihad {GSJ) network. 
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example, Osama bin Laden had 72 links to other 
terrorists and ranked second in degree. Although he 
was not a leader, Hambali had the highest degree 
score and played an important role in connecting 
different terrorist groups (see Figure 6.7a). Moreover, 
the lieutenants tended to have high scores in 
betweenness and served as gatekeepers between 
groups. The analysis implies that centrality measures 
could be useful for identifying important members of a 
terrorist network. 


Subgroup analysis. The four terrorist groups depicted 
in Figure 6.7 were color coded in the actual GSJ 
network system using Sageman’s advice. To find out 
whether these geographically based groups were also 
structurally cohesive, we calculated the cohesion score 
(Wasserman & Faust, 1994) of each group. We found 
that all these groups had high cohesion scores. The 
Southeast Asian group scored the highest in cohesion. 
This may suggest that members in this group tended 
to be more closely related to members of their own 
group than to members from other groups. According 
to Sageman, the Southeast Asian group was quite 
different from the other three groups in terms of their 
religious beliefs and missions. 


Network structure analysis. Sageman had reported 
that these groups had different structures: The 
Southeast Asian group’s structure was hierarchical 
with members at higher levels leading lower-level 
members, whereas the other three groups were scale- 
free networks (Albert & Barabdsi, 2002). However, we 
found that the four groups were similar in their degree 
distribution, which was a power-law distribution with 
a long tail for large values of degree (see Figure 6.8). 
This implies that all four networks were scale-free, 
with a few important members (nodes with high 
degree scores) dominating the network and new mem- 
bers tending to join through these dominant members. 
This finding has an important policy implication: 
Disruptive strategies should potentially be focused on 
central members in a terrorist network (Strickland, 
2002a, 2002b, 2002c, 2002d, 2002e). 


Link path analysis. Comparing the personal network 
representation (Figure 6.7b) and the operational 
network representation (Figure 6.7c), we found that 
some important members did not have direct personal 
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links to an attack prior to execution. For example, 
neither Osama bin Laden, Khalid Sheikh Mohammed, 
nor Hambali had direct personal links to terrorists in 
the 9/11 attack clique. We performed link path analy- 
sis to find out the shortest paths of personal links 
leading to the 9/11 terrorists. One of our hypotheses 
was that Osama bin Laden connected to the 9/11 clique 
through a four-hop path: bin Laden—Nashiri— 
ZaMihd—Mihdhar—Shibh (the dark path in Figure 
6.7b). Although this hypothesis turned out to be wrong 
according to Sageman’s feedback (other information 
was needed to establish the link), the analysis showed 
the potential of using link path analysis to generate 
hypotheses about the motives and planning processes 
behind terrorist attacks. 


Protecting Critical Infrastructure and Key Assets 


The Internet is a critical infrastructure and asset in the information 
age. Cybercriminals have been using various Web-based channels (e.g., 
e-mail, Web sites, Internet newsgroups, chat rooms) to distribute illegal 
materials. One common characteristic of these channels is anonymity. 
People usually do not need to provide information about their real iden- 
tity, such as name, age, gender, and address, in order to participate in 
cyberactivities. Compared with conventional crimes, cybercrime con- 
ducted through such anonymous channels creates novel challenges for 
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Figure 6.8 The power-law degree distribution of the Southeast Asian group. 
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researchers and law enforcement agencies engaged in criminal identity 
tracing. The situation is further complicated by the enormous number of 
cyberusers and activities, making the manual approach to criminal iden- 
tity tracing impossible. Law enforcement agencies urgently need 
approaches that automate criminal identity tracing in cyberspace and 
allow investigators to prioritize their tasks and focus on major criminals. 
This case study demonstrates the potential of using authorship analysis 
with carefully selected feature sets and effective classification tech- 
niques for criminal identity tracing in cyberspace (Zheng et al., 2003). 


Case Study 7: identity Tracing in Cyberspace 


Data used in this study were from open sources. Three datasets, two 
in English and one in Chinese, were collected. One of the English 
datasets consisted of 153 Usenet newsgroup illegal sales of pirated CDs 
and software messages. We manually identified the nine most active 
users (represented by a unique ID and e-mail address) who posted mes- 
sages in these newsgroups. The Chinese dataset contained 70 Bulletin 
Board System (BBS) illegal CD and software for-sale messages down- 
loaded from a popular Chinese BBS. 

The two key techniques used in this study were feature selection and 
classification. The objective was to classify text messages into different 
classes with each class representing one author. Based on a review of 
previous studies on text and e-mail authorship analysis, along with the 
specific characteristics of the messages in our datasets, we selected a 
large number of features that were potentially useful for identifying 
message authors. Three types of features were used: style markers (con- 
tent-free features such as frequency of function word, total number of 
punctuation marks, and average sentence length), structural features 
(such as use of a greeting statement, position of requoted text, use of 
farewell statement), and content-specific features (such as frequency of 
keywords, special character of content). 

For classification analysis, three popular classifiers were selected 
including the C4.5 decision tree algorithm (Quinlan, 1986), backpropa- 
gation neural networks (Lippmann, 1987), and support vector machines 
(SVM) (Cristianini & Shawe-Taylor, 2000; Hsu & Lin, 2002). Each indi- 
vidual classifier had been employed in previous authorship analysis 
research (Diederich, Kindermann, Leopold, & Paass, 2003). In general, 
SVM and neural networks had exhibited better performance than deci- 
sion trees (Diederich et al., 2003). However, most previous authorship 
studies had been based on corpora of newspaper articles such as The 
Federalist Papers. Because online messages are quite different from for- 
mal articles in style, we needed to test the performances of these three 
algorithms on our datasets. 

The procedure of the experiment was as follows: Three experiments 
were conducted on the newsgroup dataset with one classifier at a time. 
First, 205 style markers (67 for the Chinese BBS dataset) were used, nine 
structural features were added in the second run, and nine content-specific 
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features were added in the third run. A 30-fold cross-validation testing 
method was used in all experiments. 

We used accuracy, recall, and precision to evaluate the prediction per- 
formance of the three classifiers. Accuracy represents the overall predic- 
tion performance of a classifier. For each author, we used precision and 
recall to measure the effectiveness of a classifier. The three measures 
are defined in equations (1)-(3). 


(1) Accuracy= Number of messages with author correctly identified 
Total number of messages 


(2) Precision= Number of messages correctly assigned to the author 
Total number of messages assigned to the author 


(3) Recall=- Number of messages correctly assigned to the author 
Total number of messages written by the author 


We summarize the results as follows: 


© SVM and neural networks outperformed the C4.5 
decision tree algorithm. For example, in regards to the 
application of style markers to the e-mail dataset, the 
C4.5, neural networks, and SVM achieved accuracies 
of 74.29 percent, 81.11 percent, and 82.86 percent, 
respectively. SVM also consistently achieved higher 
accuracy, precision, and recall than the neural 
networks. However, the performance differences 
between SVM and neural networks were relatively 
small. Our results were generally consistent with 
previous studies, in that neural networks and SVM 
typically achieve better performance than decision tree 
algorithms (Diederich et al., 2003). 


e Use of style markers and structural features 
outperformed use of style markers only. We achieved 
significantly higher accuracy levels for all three 
datasets (p-values were below 0.05) by adopting the 
structural features. This possibly resulted from an 
author’s consistent writing patterns being evident in 
the message’s structural features. 


e Use of style markers, structural features, and 
content-specific features did not achieve better 
performance than use of style markers and structural 
features. The results indicated that using content- 
specific features as additional features did not improve 
the authorship prediction performance significantly 
(with p-value of 0.3086). We thought this was because 
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authors of illegal messages typically included diverse 
content in their messages and little additional 
information could be derived from the message content 
to determine authorship. We also observed that high 
levels of accuracy were obtained when style markers 
alone were used as input features for the English 
datasets. The accuracy level ranged from 71 to 89 
percent. The results indicated that style markers alone 
contain a large amount of information about people’s 
online message writing styles and are surprisingly 
robust in predicting the authorship. 


¢ There was a significant drop in prediction performance 
measures for the Chinese BBS dataset in comparison to 
the English datasets. For example, when using style 
markers only, C4.5 achieved average accuracies of 
86.28 and 74.29 percent for the English newsgroup 
and e-mail datasets, whereas for the Chinese dataset, 
it achieved an average accuracy of only 54.83 percent. 
A possible reason was that only 67 Chinese style mark- 
ers were used in the experiments, significantly fewer 
than the 205 style markers used with the English 
dataset. We expect to achieve higher prediction perfor- 
mances if additional Chinese style markers are identi- 
fied and included. We also observed that when 
structural features were added, all three algorithms 
achieved relatively high precision, recall, and accuracy 
(from 71 to 83 percent) for the Chinese dataset. 
Considering the significant language differences, our 
proposed approach to the problem of online message 
identity tracing appears promising in a multilingual 
context. 


Similar to “finger-print” and “voice-print” that could help identify a 
person, we believe that there is a need and potential for developing a 
robust multilingual “write-print” mode] based on an individual’s unique 
writing style. Such a model, possibly building on research in stylometrics 
(Williams, 1975) would have strong value for cybercrime investigation. 


Emergency Preparedness and Responses 


Terrorist attacks can cause devastating damage to a society through 
the use of chemical, biological, or radiological weapons. Currently, a large 
amount of infectious disease data is being collected by various laborato- 
ries, health care providers, and government agencies at local, state, 
national, and international levels (Pinner, Rebmann, Schuchat, & 
Hughes, 2003). However, access to some of these data sources and related 
search and reporting functionalities may be limited to the agencies that 
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have developed such systems (Kay, Timperi, Morse, Forslund, 
McGowan, & O’Brien, 1998), reducing the effective use of infectious dis- 
ease data in national and global contexts. In addition, real-time data 
sharing, especially of databases across species and jurisdictions, could 
enhance expert scientific review and rapid response using input and 
action triggers provided by multiple government and public health part- 
ners. In this case study we discuss our ongoing research and system 
development efforts designed to address some of these challenges. We 
aim to develop scalable technologies and related standards and protocols 
needed for a national infectious disease information infrastructure 
(Zeng, Chen, Tseng, Larson, Eidson, Gotham, et al., 2004). 


Case Study 8: The WNV-BOT Portal 


Our research focuses on two prominent infectious diseases: West Nile 
Virus (WNV) and Botulism. These two diseases were chosen because of 
their significant public health and national security implications and 
the availability of related datasets for the states of New York and 
California. We developed a research prototype called the WNV-BOT 
Portal system, which provides integrated, Web-enabled access to a vari- 
ety of distributed data sources including the New York State 
Department of Health (NYSDH), the California Department of Health 
Services (CADHS), and other federal sources (e.g., the United States 
Geological Survey [USGS)). It also provides advanced information visu- 
alization capabilities as well as predictive modeling support. 

Architecturally, the WNV-BOT Portal consists of three major compo- 
nents: a Web portal, a data store, and a communication backbone. The 
Web portal implements the user interface and provides the following 
main functionalities: (1) searching and querying available WNV/BOT 
datasets, (2) visualizing WNV/BOT datasets using spatial-temporal 
visualization, (3) accessing analysis and prediction functions, and (4) 
accessing the alerting mechanism. 

To enable data interoperability, we use Health Level Seven (HL7) 
standards (http://www.hl7.org) as the main storage format. In our data 
warehousing approach, contributing data providers transmit data to 
WNV-BOT Portal as HL7-compliant XML messages (through a secure 
network connection if necessary). After receiving these XML messages, 
the WNV-BOT Portal adds them directly to its data store. To alleviate 
potential computational performance problems associated with this HL7 
XML-based approach, we have identified a core set of data fields, on 
which searches could be performed efficiently. 

An important function of the data store layer is data ingest and access 
control. The data ingest control module is responsible for checking the 
integrity and authenticity of data feeds from the underlying information 
sources. The access control module is responsible for granting and 
restricting user access to sensitive data. 

The communication backbone component enables data exchanges 
between the WNV-BOT Portal and the underlying WNV/BOT sources 
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based upon the CDC’s (Centers for Disease Control and Prevention) 
Electronic Disease Surveillance System (NEDSS) and HL7 standards. It 
uses a collection of source-specific “connectors” to communicate with 
underlying sources. We use the connector linking NYSDOH’s Health 
Information Network (HIN) system and WNV-BOT Portal to illustrate a 
typical design of such connectors. The data sent from HIN to the portal 
system are transmitted in a “push” manner. HIN sends secure Public 
Health Information Network Messaging System (PHIN MS) messages to 
the portal at prespecified time intervals. The connector at the portal side 
runs a data receiver daemon listening for incoming messages. After a 
message is received, the connector checks for data integrity syntactically 
and invokes the data normalization subroutine. Then the connector 
stores the verified message in the portal’s internal data store through its 
data ingest control module. Other data sources (e.g., those from USGS) 
may have “pull-” type connectors, which periodically download informa- 
tion from the source Web sites and examine and store data in the por- 
tal’s internal data store. In general, the communication backbone 
component provides data receiving and sending functionalities, source- 
specific data normalization, as well as data encryption capabilities. 
The WNV-BOT Portal makes available the Spatial Temporal 
Visualizer (STV) (Buetow et al., 2003) to facilitate exploration of infec- 
tious disease case data and to summarize query results. STV has three 
integrated and synchronized views: periodic, timeline, and GIS. Figure 
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Figure 6.9 Using STV to visualize botulism data. 
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6.9 illustrates how these three views can be used to explore the infec- 
tious disease dataset. The top-left panel shows the GIS view. The user 
can select multiple datasets to be shown on the map in a layered man- 
ner using the checkboxes. The top-right panel corresponds to the time- 
line view displaying the occurrences of various cases using a Gantt 
chart-like display. The user can also access case details easily by using 
the tree display located left of the timeline display. Below the timeline 
view is the periodic view through which the user can identify periodic 
temporal patterns (e.g., which months have an unusually high number 
of cases). The bottom portion of the interface allows the user to specify 
subsets of data to be displayed and analyzed. 

Our project has supported exploration of, and experimentation with, 
technological infrastructures needed for a full-fledged implementation of 
a national infectious disease information infrastructure and has helped 
foster information sharing and collaboration among related government 
agencies at state and federal levels. In addition, we have obtained 
important insights into, and hands-on experience with, various impor- 
tant policy-related challenges faced by developing a national infrastruc- 
ture. For example, a nontrivial part of our project activity has been 
devoted to developing data-sharing agreements between project part- 
ners from different states. 

Our ongoing technical research is focusing on two aspects of infec- 
tious disease informatics: hotspot analysis and efficient alerting and dis- 
semination. For WNV, localized clusters of dead birds typically identify 
high-risk disease areas. Automatic detection of dead bird clusters using 
hotspot analysis can help predict disease outbreaks and allocate pre- 
vention/control resources effectively. Initial experimental results indi- 
cate that these techniques are promising for disease informatics 
analysis. We are planning to augment existing predictive models by con- 
sidering additional environmental factors (e.g., weather information, 
bird migration patterns), and tailoring data mining techniques for infec- 
tious disease datasets that have prominent temporal] features. 


Case Study Summary 


We summarize in Table 6.7 the eight case studies in terms of their 
data characteristics, technologies employed, and the national security 
missions they addressed using our proposed ISI research framework. 


The ISI Partnership Framework 


In order to accomplish the six critical mission areas of national secu- 
rity, the Department of Homeland Security has proposed establishing a 
network of laboratories consisting of satellite research centers across the 
nation (U.S. Office of Homeland Security, 2002). The purpose is to create 
a multidisciplinary environment for developing technologies to counter 
various threats to homeland security. However, information sharing and 
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Table 6.7 Summary of IS! case studies 


Case 


Critical Mission 


visualization 


Study Project Data Characteristics Technologies Used ‘Avena Addcessed 
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1 deceation a aati ae * Association mining Intelligence and 
a ee, : 
detection identity records pimilanty based wate 
2 Dark Web ° Open source * Cluster analysis Intelligence and 
| Portal * Web hyperlink data *_ Visualization warning 
Aiiliodiative source ° Information sharing Border and 
3 BorderSafe Structured data and integration transportation 
i | ¢ Database federation {_security 
4 COPLINK ¢ Authoritative source ¢ Association mining Domestic counter- 
| Detect * Structured data ¢ Statistical-based terrorism 
—_ > - > 
5 Somme ¢ Authoritative source moolal ceenote enalysis Domestic counter- 
network + Structured data ¢ Cluster analysis terrorism 
analysis ¢ Visualization el 
Terrorist ¢ Open source . F = : 
Intelligence text mining | Domestic counter- 
6 network ° s c : : 
aaalydie el t data, structured ¢ Social network analysis | terrorism 
1 
Identity * Open source * Intelligence text mining Protecting critical 
7 tracing in fiers infrastructure and 
¢ Structured data ¢ Classification 
| cyberspace 3 key assets 
* Information sharing Eniereene 
8 WNV-BOT | ¢* Authoritative source and integration B di y 
Portal ¢ Structured data ¢ Spatial and temporal Preparedness and 
responses 


collaboration across different jurisdictions, agencies, and research insti- 
tutes is not merely a technical issue. A variety of social, organizational, 
and political barriers needs to addressed, including: 


e Security and confidentiality. In the intelligence and 
law enforcement domain, security is of great concern. 
Data regarding crimes, criminals, terrorist organiza- 
tions, and potential terrorist attacks may be highly 
sensitive and confidential in nature. Improper use of 
data could lead to fatal consequences. 


¢ Trust and willingness to share information. Different 
agencies may not be motivated to share information 
and collaborate if there is no immediate gain. They 
may also fear that information being shared will be 
misused, resulting in legal liabilities. 


¢ Data ownership and access control. The questions that 
need to be addressed are: Who owns a particular data 
set? Who is allowed to access, aggregate, or input 
data? Who owns the derivative data (knowledge)? For 
both original and derivative data, who is allowed to 
distribute them to whom? 
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The COPLINK Center at the Artificial Intelligence Lab of the 
University of Arizona, as a leading research center for law enforcement 
and intelligence information and knowledge management, intends to 
become a part of the national network of laboratories. During its devel- 
opment over the past decade the COPLINK Center has encountered 
many of these non-technical challenges in its partnerships with various 
law enforcement and federal agencies. In this section, we summarize 
some of our experiences and lessons learned. 


Ensuring Data Security and Confidentiality 


In any data sharing initiative, it is essential to make sure that the 
data shared between agencies are secure and that the privacy of indi- 
viduals is respected. In our research we have taken the necessary mea- 
sures to ensure data privacy, security, and confidentiality. Data shared 
among law enforcement agencies, such as TPD, PPD, and CBP, con- 
tained only law enforcement data and were available only to individuals 
screened by these agencies using a combination of TPD Background 
Check, Employee Non-Disclosure Agreement (NDA), and the Terminal 
Operator Certificate (TOC) test. 

All personnel who have access to law enforcement data fill out back- 
ground forms provided by TPD and have their fingerprints taken at 
TPD. They also sign a nondisclosure agreement provided by TPD. In 
addition, they take the TOC test every year. The background informa- 
tion and fingerprints are then checked by TPD investigators to ensure 
lack of involvement in criminal activity and to verify identity. 

In addition to these forms and test, all law enforcement data in the 
University of Arizona COPLINK Center reside behind a firewall and in 
a secure room accessible only by activated cards to those who have met 
the security criteria. As soon as an employee stops working on projects 
related to law enforcement data, his or her card is deactivated. However, 
the NDA is perpetual and remains in effect even after a researcher or 
employee leaves. These requirements are similar to those imposed upon 
noncommissioned civilian personnel in a police department. 


Reaching Agreements Among Partners 


Federal, state, and local regulations require that agreements between 
agencies within their respective jurisdictions receive advanced approval 
from their governing hierarchy. This precludes informal information 
sharing agreements between those agencies. We found that require- 
ments varied from agency to agency according to the statutes by which 
they were governed. 

For instance, the ordinances governing information sharing by the city 
of Tucson differed somewhat from those governing the city of Phoenix. 
This necessitated numerous attempts and passes at proposed documents 
by each city’s law enforcement and legal staffs before a final draft could 
be settled upon for approval by the city councils. We found that similar 
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language existed in the ordinances and statutes governing this exchange, 
but that the processes varied significantly. It appears that the level of 
bureaucracy is proportional to the size of the jurisdiction. 

TPD has recently developed a generic Inter-Governmental Agreement 
(IGA) that could be adopted between different law enforcement agencies. 
This IGA was condensed from memoranda of understanding (MOUs), 
policies, and agreements that previously existed in various forms 
between numerous agencies. The IGA was drafted to be generic, includ- 
ing language from those laws but excluding reference to any particular 
chapter or section. This allowed the required verbiage to exist in the doc- 
ument without being specific to any jurisdiction. 

Sharing information between agencies with disparate information sys- 
tems has also led to the bridging of boundaries between software vendors 
and agencies (their customers). We took care not to violate licensing 
terms by ensuring that nondisclosure agreements existed and that con- 
tract language assured compliance with the vendors’ licensing policies. 

We believe MOUs and IGAs can be used as templates of information 
sharing agreements and contracts, and can serve as components of an 
ISI partnership framework. We plan to provide free access to these legal 
agreement templates to help facilitate the process of information shar- 
ing and collaboration across agencies and research institutions in the 
future. 


Conclusions and Future Directions 


In this chapter we have discussed the technical issues related to intel- 
ligence and security informatics research, which supports accomplish- 
ment of the critical missions of national security. We have proposed a 
research framework addressing the technical challenges facing counter- 
terrorism and crime-fighting applications, with a primary focus on 
knowledge discovery from databases (KDD). We have identified and 
incorporated into the framework six classes of ISI technologies: infor- 
mation sharing and collaboration, crime association mining, crime clas- 
sification and clustering, intelligence text mining, spatial and temporal 
analysis of crime patterns, and criminal network analysis. We have also 
presented a set of COPLINK case studies, ranging from the detection of 
criminal identity deception to an intelligent Web portal for monitoring 
terrorist Web sites, thus demonstrating the potential of ISI technologies 
for contributing to the critical missions of national security. 

As this new ISI domain continues to evolve, several important direc- 
tions need to be pursued, including technology development; testbed cre- 
ation; and social, organizational, and policy studies: 


e New technologies need to be developed and many 
existing information technologies should be 
re-examined and adapted for national security 
applications. The knowledge discovery perspective 
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provides a promising direction. However, new 
technologies should be developed in a legal and ethical 
framework that does not compromise the privacy or 
civil liberties of private citizens. 


® Large scale, nonsensitive data testbeds that 
incorporate data from diverse, authoritative, and open 
sources and in different formats should be created and 
made available to the ISI research community. Lack of 
real data has been a long-standing problem in 
intelligence- and security-related research. Many 
researchers are forced to use simulated or synthetic 
data that may not resemble actual crime data 
characteristics. Furthermore, comparing competing 
technical approaches has been difficult because of the 
lack of standard test collections. A comprehensive and 
non-sensitive open source data collection, analogous to 
the Message Understanding Conference collection, 
would be of great value for ISI researchers to 
experiment, test, and evaluate various technologies 
and to compare and share findings, insights, and 
knowledge. Advanced methods may need to be 
employed to scrub data contained in the non-open 
source testbed to ensure data confidentiality while 
preserving its characteristics and underlying 
structures. 


The ultimate goal of ISI research is to enhance national security. 
However, the question of how this type of research has and will have an 
impact on society, organizations, and the general public remains unan- 
swered. Researchers from sociology, political science, organizational and 
management sciences, psychology, and education can contribute sub- 
stantially to this task. 

We hope that active ISI research will help improve knowledge discov- 
ery and dissemination; enhance information sharing and collaboration 
among academics, industry, and local, state, and federal agencies; and 
thereby promote positive societal outcomes. 
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